# a_stochastic_bundle_method_for_interpolation__4ec52cce.pdf

Journal of Machine Learning Research 23 (2022) 1-57 Submitted 6/20; Revised 04/21; Published 2/22

A Stochastic Bundle Method for Interpolating Networks

Alasdair Paren alasdair.paren@gmail.com Department of Engineering Science University of Oxford Oxford, UK

Leonard Berrada lberrada@robots.ox.ac.uk Department of Engineering Science University of Oxford Oxford, UK

Rudra P. K. Poudel rudra.poudel@crl.toshiba.co.uk Cambridge Research Laboratory, Toshiba Europe Ltd, Cambridge, UK.

M. Pawan Kumar pawan@robots.ox.ac.uk Department of Engineering Science University of Oxford Oxford, UK.

Editor: Julien Mairal

We propose a novel method for training deep neural networks that are capable of interpolation, that is, driving the empirical loss to zero. At each iteration, our method constructs a stochastic approximation of the learning objective. The approximation, known as a bundle, is a pointwise maximum of linear functions. Our bundle contains a constant function that lower bounds the empirical loss. This enables us to compute an automatic adaptive learning rate, thereby providing an accurate solution. In addition, our bundle includes linear approximations computed at the current iterate and other linear estimates of the DNN parameters. The use of these additional approximations makes our method signiﬁcantly more robust to its hyperparameters. Based on its desirable empirical properties, we term our method Bundle Optimisation for Robust and Accurate Training (BORAT). In order to operationalise BORAT, we design a novel algorithm for optimising the bundle approximation eﬃciently at each iteration. We establish the theoretical convergence of BORAT in both convex and non-convex settings. Using standard publicly available data sets, we provide a thorough comparison of BORAT to other single hyperparameter optimisation algorithms. Our experiments demonstrate BORAT matches the state-of-the-art generalisation performance for these methods and is the most robust. Keywords: Bundle Methods, Stochastic Optimisation, Neural Network Optimisation, Interpolation, Adaptive Learning-rate

Authors contributed equally to this work.

2022 Alasdair Paren, Leonard Berrada, Rudra P. K. Poudel and M. Pawan Kumar.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/20-1248.html.

Paren, Berrada, Poudel and Kumar

1. Introduction

Training a deep neural network (DNN) is a challenging optimization problem: it involves minimizing the average of many high-dimensional non-convex loss functions. In practice, the main algorithms utilised are Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951) and adaptive gradient methods such as Ada Grad (Duchi et al., 2011) or Adam (Kingma and Welling, 2014). It has been observed that SGD tends to provide better generalization performance than adaptive gradient methods (Wilson et al., 2017). However, the downside of SGD is that it requires the manual design of a learning-rate schedule. Many forms of schedule have been proposed in the literature, including piece wise constant (Huang et al., 2017), geometrically decreasing (Szegedy et al., 2015) and warm starts with cosine annealing (Loshchilov and Hutter, 2017). Examples of these schemes are plotted in Figure 1. Consequently, practitioners who wish to use SGD in a novel setting need to select a learning-rate schedule for their learning task. To that end, they ﬁrst need to choose the parameterization of the schedule (e.g. picking one of the examples given above). Then they need to tune the corresponding parameters to get good empirical performance. This typically results in a cross-validation that searches over many critical and sensitive hyper-parameters. For example, a piece wise linear scheme requires a starting learning rate value, a decay factor and a list or metric to determine at which points in training to decay the learning rate. Due to the high dimensionality of this search space performing a grid search can mean training a large number of models. This number increases exponentially in combination with other hyperparameters such as regularisation and batch size. Thus, ﬁnding an SGD learning rate schedule that produces strong generalisation performance for a new task is time and computationally demanding, often requiring hundreds of GPU hours.

0 25 50 75 100 125 150 175 200 Epochs

Learning Rate (ηt)

Piece wise constant Geometric Cyclic with Cosine Annealing

Figure 1: SGD Learning rate schedules proposed in the literature

In this work, we alleviate this issue by presenting a family of algorithms that achieve comparable generalisation performance to SGD with a highly reﬁned learning rate schedule, while requiring far less tuning of hyperparameters. This in turn leads to a reduction in the time, cost and energy required when ﬁnding a highly accurate network for a new task. In more detail, we present a novel family of proximal algorithms for the optimisation of DNNs that are capable of interpolation. A DNN is said to interpolate a data set if it has the ability

A Stochastic Bundle Method for Interpolation

to simultaneously drive the loss of all the training samples in a data set to their minimum value. Thus a lower bound of the objective function is known: for instance, it is close to zero for a model trained with the cross-entropy loss. Our algorithms approximate the loss within each proximal problem by a bundle, that is, a point-wise maximum over linear functions. By including the interpolation lower bound within this bundle, we obtain the following two modelling beneﬁts: (i) our model more closely mimics the true loss function than existing baselines like SGD, and; (ii) the learning rate gets automatically re-scaled at each iteration of the algorithm, thereby providing accurate updates. By increasing the number of linear approximations in the bundle, the true loss is better modelled. As an upshot, we obtain additional stability to optimisation and task-speciﬁc hyperparameters. Based on its highly desirable empirical properties, we term these methods Bundle optimization for Robust and Accurate Training (BORAT). All the variants of BORAT use a single learning rate hyperparameter that requires minimal tuning. In particular, the learning rate hyperparameter is kept constant throughout the training procedure, unlike the learning rate of SGD that needs to be decayed for good generalization. The BORAT family of algorithms obtain state-of-the-art empirical performance for single hyperparameter training of neural networks.

Contributions

We design a family of adaptive algorithms that have a single hyperparameter that does not need any decaying schedule. In contrast, the related APROX (Asi and Duchi, 2019) and L4 (Rolinek and Martius, 2018) use respectively two and four hyperparameters for their learning-rate.

For the deep learning setting we establish a link between stochastic optimisation methods with adaptive learning rates and proximal bundle methods.

We provide convergence rates in various stochastic convex settings and for a class of non-convex problems.

We derive a novel algorithm for solving small quadratic programs where the constraints deﬁne the probability simplex. This algorithm permits a parallel implementation allowing for eﬃcient solution on modern hardware.

We empirically demonstrate the increased stability to hyperparameters when increasing the bundle size. We show this on the CIFAR100 and Tiny Image Net data sets.

We achieve state-of-the-art results for learning a diﬀerentiable neural computer; training variants of residual networks on the SVHN and CIFAR data sets, and training a Bi LSTM on the Stanford Natural Language Inference data set.

A preliminary version of this work appeared in the proceedings of ICML 2020 (Berrada et al., 2019b). While this previous work has considered bundles of size 2 resulting in the ALI-G algorithm detailed with in Section 3.3. This article signiﬁcantly diﬀers from the previous work by (i) considering bundles of size greater than 2; (ii) introducing an novel algorithm to compute the exact solution of each bundle; (iii) investigating the robustness towards hyperparameters; and (iv) showing how the use of large bundles permits easy application of BORAT to challenging non-smooth losses.

Paren, Berrada, Poudel and Kumar

2. Preliminaries

Loss Function. We consider a supervised learning task where the model is parameterized by w Rd. Usually, the objective function can be expressed as an expectation over z Z, a random variable indexing the samples of the training set:

f(w) Ez Z[ℓz(w)], (1)

where each ℓz is the loss function associated with the sample z. We assume that each ℓz admits a known lower bound B. For simplicity we will often assume this lower bound is 0, which is the case for the large majority of loss functions used in machine learning. For instance, suppose that the model is a deep neural network with weights w performing classiﬁcation. Then for each sample z, ℓz(w) can represent the cross-entropy loss, which is always non-negative. Other non-negative loss functions include the structured or multi-class hinge loss, and the ℓ1 or ℓ2 loss functions for regression. Note that it is always possible to subtract a non-zero lower bound B from the loss function to deﬁne a new equivalent problem that satisﬁes the aforementioned assumption.

Interpolation. In this work we consider DNNs that can interpolate the data. Formally, we assume: w : z Z, ℓz(w ) ϵ, (2)

where ϵ is a tolerance on the amount of error in the interpolation assumption. We will often want to make reference to the case when ϵ = 0. Following previous work (Ma et al., 2018b) we will refer to this setting as perfect interpolation. The interpolation property is satisﬁed in many practical cases, since modern neural networks are typically trained in an over-parameterized regime where the parameters of the model far exceed the size of the training data (Li et al., 2020). Additionally most modern DNN architectures can be easily increased in size and depth, allowing them to interpolate all but the largest data sets. Note, the data has to be labelled consistently for this to be possible. For instance it is impossible to interpolate a data set with two instances of the same image that have two diﬀerent labels.

Regularisation. It is often desirable to encourage generalisation by the addition of a regularisation function φ(w) to the objective. Typical choices for φ include λ w 2 and λ w 1 where λ governs the strength of the regularisation. However, in this work we incorporate such regularisation as a constraint on the feasible domain: Ω= w Rd : φ(w) r for some value of r. In the deep learning setting, this will allow us to assume that the objective function can be driven close to zero without unrealistic assumptions about the value of the regularisation term for the ﬁnal set of parameters. Our framework can handle any constraint set Ωon which Euclidean projections are computationally eﬃcient. This includes the feasible set induced by ℓ2 regularization: Ω= w Rd : w 2 2 r , for which the projection is given by a simple rescaling of w. Finally, note that if we do not wish to use any regularization, we deﬁne Ω= Rd and the corresponding projection is the identity.

Problem Formulation. The learning task can be expressed as the problem (P) of ﬁnding a feasible vector of parameters w Ωthat minimizes f:

w argmin w Ω f(w). (P)

We refer to the minimum value of f over Ωas f : f minw Ωf(w).

A Stochastic Bundle Method for Interpolation

Proximal Perspective of Projected Stochastic Gradient Descent. In order to best introduce BORAT, we ﬁrst detail the proximal interpretation of projected stochastic gradient descent (PSGD). The PSGD algorithm can be seen as solving a sequence of proximal problems. Within each proximal problem, a minimisation is performed over an approximate local model of the loss. This approximation is the ﬁrst order Taylor s expansion of ℓzt around the current iterate and a proximal term. At time step t, the PSGD proximal problem has the form:

wt+1 = argmin w Ω

2ηt w wt 2 + ℓzt(wt) + ℓzt(wt) (w wt) . (3)

Here wt is the current iterate, zt is the index of the sample chosen and ηt is the learning rate. For convex Ω, problem (3) can be solved in two steps: ﬁrst solving the unconstrained problem and then using Euclidean projection onto Ω. Setting Ω= Rd removes the need for projection and we recover SGD. When clear from the context we will use SGD to refer to both projected and un-projected variants. To solve the unconstrained problem one only needs to set the gradient to zero to recover the familiar closed form SGD update.

3. The BORAT Algorithm

In this section we detail the BORAT algorithm. We start by introducing BORAT s proximal problem that is exactly solved at each iteration. We explain its advantages and disadvantages in relation to the SGD proximal problem (3). Each BORAT proximal problem is best solved in the dual. Hence we next introduce the dual problem, which permits a far more eﬃcient solution due to its low dimensionality. We then consider a special case of BORAT with minimal bundle size which we call Adaptive Learning-rates for Interpolation with Gradients (ALI-G). ALI-G permits a closed form solution and results in an automatically scaled gradient descent step. Speciﬁcally, it recovers a stochastic variant of the Polyak step size (Polyak, 1969), which oﬀers competitive results in practice. This special case is used extensively in our experiments and in our analysis to establish the convergence rate of BORAT. Lastly, we detail our novel direct method for eﬃciently solving the general dual problem. This algorithm exploits the small size of the bundle to compute the exact optimum by solving a ﬁnite number of linear systems, removing the need for an inner iterative optimisation algorithm.

3.1 Primal Problem

For tackling problems of type (P) we identify two deﬁciencies in the proximal view of SGD (3). First, the approximation of the loss permits negative values even though the loss for (P) is deﬁned to be non-negative. Second, the accuracy of a linear model quickly deteriorates for functions with high curvature away from the site of the approximation. Due to this crude model, the selection of ηt for all t is critical for achieving good performance with SGD. BORAT aims to address these deﬁciencies by using a model composed of a point-wise maximum over N linear approximations to better model the loss. One of the linear approximations is chosen to be a constant function equal to the loss lower bound, that is, 0. By including this linear approximation, we address the ﬁrst deﬁciency. The second deﬁciency is addressed by making use of the remaining N 1 linear approximations. These extra approximations allow us to model variation over the parameter space and positive

Paren, Berrada, Poudel and Kumar

curvature of the loss. A model of this form in combination with a proximal term is commonly known as a bundle of size N. The main disadvantage of bundles is that they require multiple gradient evaluations to be performed and then held in memory. Hence we in this work only consider N 5, except where mentioned otherwise. Each linear approximation of the loss is constructed at a point ˆwn t using a diﬀerent loss function ℓzn t . The subscript t indicates the iteration number, and n indexes over the N linear approximations. With this notation we ﬁrst introduce a bundle of size 1 as:

wt+1 = argmin w Ω

2η w wt 2 + ℓz1 t ( ˆw1 t ) + ℓz1 t ( ˆw1 t ) (w ˆw1 t ) . (4)

If we set ˆw1 t to wt we recover the SGD proximal problem. Thus SGD eﬀectively uses a bundle of size N = 1. We next introduce an expanded expression for a bundle of size N, before showing how to convert this into the compact form of maxn [N] an (w wt) + bn . Within a bundle each linear approximation is formed around a diﬀerent point ˆwn t . Hence in order to write each linear approximation in the aforementioned compact form we split each linear term in two. The ﬁrst piece is a constant term, that does not depend on w, and is a multiple of the distance between the current iterate and the centre of each approximation ˆwn t wt. The second term depends on the distance w wt, for all linear approximations. This gives the following expanded form for a bundle of size N as:

n ℓzn t ( ˆwn t ) ℓzn t ( ˆwn t ) ( ˆwn t wt) + ℓzn t ( ˆwn t ) (w wt) o , (5)

where we use the notation [N] to deﬁne the set of integers {1, ..., N}. If we deﬁne bn t = ℓzn t ( ˆwn t ) ℓzn t ( ˆwn t ) ( ˆwn t wt), we can thus simplify the expression into the desired compact form. We now introduce the BORAT proximal problem at time t with a bundle of size N, which can be stated as:

wt+1 = argmin w Ω

2η w wt 2 + max n [N]

n ℓzn t ( ˆwn t ) (w wt) + bn t o . (6)

For BORAT we always set ˆw1 t = wt. We additionally use the last linear approximation to enforce the lower bound on the loss. This is done by setting ℓzt( ˆw N t ) = [0, ..., 0] , b N t = B = 0. We give details on how we select ˆwn t for n {2, ..., N 1} in Section 3.5. Thus each bundle is composed of N 1 linear approximations of the function, and the lower bound on the loss. These linear approximations of the loss need to be stored in memory during each step. Hence, in order to ﬁt on a single GPU we only consider small bundle sizes in this work (N 5). For clarity we depict a 1D example for a bundle with N = 3 in Figure 2. Unlike SGD, the BORAT proximal problem (6) is not smooth and hence cannot be solved by simply setting the derivatives to zero. Instead we choose to solve each proximal problem in the dual.

A Stochastic Bundle Method for Interpolation

wt = ˆw1 t w ˆw2 t wt+1

Loss function lz1 t Loss function lz2 t Linearization at ˆw1 t Linearization at ˆw2 t Lower Bound B Model of f

Figure 2: Illustration of a BORAT bundle (N = 3) in 1D, shown in green. Two stochastic samples ℓz1 t and ℓz2 t of the loss function f are shown in red and blue (solid lines). The bundle is formed of the max of three linear approximations (dashed lines) and a proximal term. Two of these linear approximations are formed using the loss functions ℓz1 t and ℓz2 t , and the last enforces the known lower bound on the loss. Here the BORAT approximation gives a more accurate model than approximation used by ALI-G, which would only include the linearization at ˆw1 t and the lower bound B. In this simple example this improved accuracy allows for a larger step to be taken towards the minimum.

3.2 Dual Problem

The dual of (6) is a constrained concave quadratic maximisation over N dual variables α1, ..., αN, and can be concisely written as follows (see supplementary material for derivation):

αt = argmax α N D(α), where D(α) = η

2α A t Atα + α bt. (7)

Here At is a N d matrix whose nth row is ℓzt( ˆwn t ). We deﬁne bt = [b1 t , ..., b N t ] , α = [α1, α2, . . . , αN] and N is a probability simplex over the N variables. The dual problem (7) has a number of features that make it more appealing for optimisation than the primal (6). First, the primal problem is deﬁned over the parameters space w Rd, where p is in the order thousands if not millions for modern DNNs. In contrast, the dual variables are of dimension N, where is N typically a small number > 10 due to the memory requirements. Second, the dual is smooth and hence allows for faster convergence with standard optimisation techniques. Furthermore, as will be seen shortly, we use the fact that the dual feasible region is a tractable probability simplex to design a customised algorithm for its solution. We detail this algorithm in this Section 3.6. Once we have found the dual solution αt, we recover the following update from the KKT conditions:

wt+1 = wt ηAtαt. (8)

The form of the update (8) deserves some attention. Since each row of At is either the gradient of the loss ℓzn t or a zero vector, and αt belongs to the probability simplex, the update step moves in the direction of a non-negative linear combination of negative gradients ℓzn t ( ˆwn t ). Due to the deﬁnitions of ℓzt( ˆw N t ) = [0, ..., 0] and b N t = 0, any weight given

Paren, Berrada, Poudel and Kumar

to αN reduces the magnitude of the resulting step. This has the eﬀect that as the loss value gets close to zero BORAT automatically decreases the size of the step taken.

3.3 ALI-G (N=2)

We now consider a special case of BORAT with N = 2. Here the bundle is the point wise maximum over the linear approximation of the loss around the current point and the global lower bound B, which we assume is 0. This special case is worthy of extra attention for the following four reasons. First, this special case only requires one gradient evaluation per step and has a similar time complexity to SGD. Second, it admits a closed form solution. Third, we use this special case extensively in our analysis of the convergence rate of BORAT. Fourth, given the deﬁnitions ˆw1 t = wt and ˆw N t = 0, if we set N = 2 we recover an algorithm that simply scales the SGD learning rate. Speciﬁcally, it automatically scales down a maximal learning rate η by a factor α1 [0, 1] to an appropriate value close to optimality. This is clear from the simpliﬁed version of Equation (8), which has the following form:

wt+1 = wt α1η lzt(wt). (9)

Hence we will call this special case Adaptive Learning-rates for Interpolation with Gradients (ALI-G). For ALI-G the primal problem (6) simpliﬁes to the following:

2η w wt 2 + max{0, ℓzt(wt) + ℓzt(wt) (w wt)} . (10)

Likewise, the dual problem (7) can be reduced to the following:

α1 = argmax α1 [0,1]

2 α1 lzt(wt) 2 + α1lzt(wt) o . (11)

The ALI-G dual is a maximisation over a constrained concave function in one dimension. Hence we can obtain the optimal point by projecting the unconstrained solution on to the feasible region. This results in the following closed form solution:

α1 = min lzt(wt) η lzt(wt) 2 , 1 . (12)

This value of α1 is then used in (9). The ALI-G update can be viewed as a stochastic analog of the Polyak step size (Polyak, 1969) with the addition of a maximal value η. Recall that, from the interpolation assumption, we have f = 0. The ALI-G update is computationally cheap with the evaluation of lzt(wt) 2 being the only extra computation required over SGD. Hence when the interpolation assumption holds we suggest that ALI-G can be easily used in place of SGD.

3.4 The Advantage of Bundles with more than Two Pieces

While ALI-G has many favourable qualities, its local model of the loss is still crude. We next give two motivating examples to help demonstrate why using a more complex model of the loss often increases the robustness to η. Thus it may prove useful to use larger values of N

A Stochastic Bundle Method for Interpolation

in settings that are sensitive to step size η. In the convex setting any function can be perfectly modelled by the point-wise maximum over an inﬁnite number of linear approximations. While intractable, performing a minimisation over this model would recover the true optimum by deﬁnition. With this perfect model any value of η could be used. Setting η to a large enough value would recover the optimum in a single step. This example demonstrates that, at least asymptotically as the accuracy of the local model increases we can expect a reduced dependence on the correct scale of the step size.

f : w 7 w2 |w|3

Linearizations of f

Figure 3: A simple example where the ALI-G step-size oscillates due to non-convexity. For this problem, ALI-G only converges ifs its maximal learning-rate η is less than 10. By contrast for the same example BORAT with N > 2 converges for all values of η. Additionally for η 10 it converges to the optimum in a single update.

Figure 3 provides a non-convex motivating example for use of larger values of N. Here we demonstrate a 1D symmetric function where ALI-G does not converge for large η. Instead it oscillates between the two values w = 3/5 and w = 3/5. However, if we were to use a bundle with a N 3 our model of the loss would include both the linear approximations at w = 3/5 and w = 3/5 simultaneously and hence when minimising over this model we converge to the optimum for any value of η. While this is a carefully constructed synthetic example it highlights why we would expect a more accurate model of the loss can help to reduce the dependence on the step size.

3.5 Selecting Additional Linear Approximations or the Bundle

When constructing bundles of size larger than two, we are faced with two design decisions regarding how to select additional linear approximations to add to the bundle. First, where in parameter space should we construct the additional linear approximations ˆwn t ? And second, should we use the same mini-batch of data when constructing the stochastic linear approximations, or should we sample a new batch to evaluate each linear approximation?

Paren, Berrada, Poudel and Kumar

Selecting ˆwn t . Ideally we would select the location of the linear approximations ˆwn t for n {2, ..., N 1} in order to maximise the progress made towards w at each step. However, without the knowledge of w a priori, due to the high dimensional and non-convex nature of problem (P) this is infeasible. Instead we make use of a a heuristic. Inspired by the work of previous bundle methods for convex problems (Smola et al., 2007, Asi and Duchi, 2019) we select ˆwn t using the following method:

ˆwn t = argmin w Rd

2η w wt 2 + max k [n 1]

n ℓzk t ( ˆwk t ) (w wt) + bk t o . (13)

In other words we construct the bundle by recursively adding linear approximations centred at the current optimum. This method of selecting additional linear approximations is appealing as it requires no extra hyperparameters and helps reﬁne the approximation in the region of parameter space that would be explored by the next update.

Re-sampling z for additional linear approximations. When constructing additional linear approximations we choose to re-sample z. Concretely, we use a new mini-batch of data to construct each stochastic linear approximation. While it is possible to construct all N 1 non-zero linear approximations using the same batch of data we ﬁnd this does not work well in practice. Indeed, such a method behaves similarly to taking multiple consecutive steps of SGD on the same mini-batch, which tends to produce poor optimisation performance.

Summary. We now summarise the bundle construction procedure for N > 2. We construct a bundle around wt by ﬁrst using two linear approximations, one centred at wt and the second given by a known lower bound on the loss. We then sequentially add linear approximations until we have N. These extra linear approximations are constructed one at a time using new batches of data and centred around the point that is the current minimizer of the bundle. We note that each parameter update of BORAT uses N 1 batches of data. Therefore BORAT updates the parameters N 1 fewer times than SGD in an epoch (given the same batch-size). Once the bundle is fully constructed we update wt. At this stage we apply momentum (if enabled) and project back on the feasible set Ω. The construction of the bundle requires solving a minimization problem for each newly added piece when N 2. Therefore it is critical that such a problem gets solved very eﬃciently. We next detail how we do this by solving the corresponding dual problem for N 2.

3.6 Eﬃcient Dual Algorithm to Compute N 2 Linear Pieces

In the general case (N > 2), the dual has more than 2 degrees of freedom and can not be written as a 1D minimisation. Thus the method derived for the case N = 2 is no longer applicable. This means it is not possible to obtain a simple closed form update. We must instead run an inner optimisation to solve (7) at each step. Many methods exist for maximising a concave quadratic objective over a simplex. Two algorithms particularly well suited to problems of the form (7) are (Frank and Wolfe, 1956) or Homotopy Methods (Bach et al., 2011). However, we propose a novel algorithm that exploits the fact that N is small to ﬁnd the maximum directly. This method decomposes the problem of solving (7) into several sub-problems which provides two computational conveniences. First, the BORAT algorithm repeatedly searches for solutions to a bundle with only one newly added linear piece since

A Stochastic Bundle Method for Interpolation

the last search. As one might expect, this task shares a great number of sub-problems with the previous solution and allows for much of the computation to be reused. Second, our dual algorithm allows for a parallel implementation, which makes it very fast on the hardware commonly used for deep learning. To illustrate the eﬃciency of our dual method, the run time of the dual solution, that is ﬁnding αt once we have constructed (7), takes less than 5% of the time spent in the call of the optimiser. Note due to the large size of the the networks and the small size of N the majority of the call time is dominated computing A t At and Atαt. We now formally introduce our dual solution algorithm. Our method uses the observation that at the optimal solution in a simplex the partial derivatives will be equal for all nonzero dimensions. This observation can be formally stated as: Proposition 1 (Simplex Optimally Conditions)

Let F : RN R be a concave function. Let us deﬁne α = argmaxα F(α). Then there exists c R such that:

n [N] such that αn > 0, we have: F(α)

α=α = c. (14)

In other words, the value of the partial derivative is shared among all coordinates of α that are non-zero.

This proposition can be easily proved by contradiction. If the partial derivatives are not equal, then moving in the direction of the largest would result in an increase in function value. Likewise moving in the negative direction would produce a decrease. Hence the current point cannot be optimal. Please see Appendix B for a formal proof of Proposition 1. In the following paragraphs we explain how this proposition can be used to break up the task of solving problem (7) into 2N 1 subproblems. Given a unique subset I of non-zero dimensions of α Each of these subproblems involves ﬁnding the unconstrained optimum and checking if this point lies within the simplex. We now give an example of a single subproblem. To simplify notation, let Q ηA t At. We note that:

α=α = Qα + bt. (15)

If we knew that α had exclusively non-zero coordinates, then by applying Proposition (1) to the dual objective D we can recover a solution α and by solving the following linear system: α c

The ﬁrst N rows of the system would enforce that α satisﬁes the condition given by Proposition 1, and the last row of the linear system would ensure that the coordinates of α sum to one. In the general case, we do not know which coordinates of α are non-zero. However, since typical problems are in low-dimension N, we can enumerate all possibilities of subsets of non-zero coordinates for α . We detail this further. We consider a non-empty subset I [N], for which we deﬁne:

Q[I I] (Qi,j)i I,j I , b[I] (bt,i)i I . (17)

Paren, Berrada, Poudel and Kumar

We then solve the corresponding linear subsystem: φ(I)

Q[I I] 1 1 0

This φ(I) R|I| can then be lifted to RN by setting zeros at coordinates non-contained in I. Formally, we deﬁne ψ(I) RN such that:

i [N], ψ(I) i =

( φ(I) i if i I, 0 otherwise. (19)

Therefore, given I [N], we can generate a candidate solution ψ(I) for problem (7) by solving a linear system in dimension |I|. In the following proposition, we establish that doing so for all possibilities of I guarantees to ﬁnd the correct solution:

Proposition 2 (Problem Equivalence)

We deﬁne the set of feasible solutions reached by the diﬀerent candidates ψ(I):

Ψ = n ψ(I) : I [N], I = o N. (20)

Then we have that: argmax α N D(α) = argmax ψ Ψ D(ψ). (21)

In other words, we can ﬁnd the optimal solution of (7) by enumerating the members of Ψ and picking the one with highest objective value.

This proposition is trivially true because Ψ is simply the intersection between (i) the original feasible set N and (ii) the set of vectors that satisfy the necessary condition of Optimality given by Proposition 1. This insight results in Algorithm 1 which returns a optimal solution to the dual problem. We characterise this claim in the following proposition.

Proposition 3 (Sets of Solutions)

Algorithm 1 returns a solution α that satisﬁes α argmaxα N D(α). This is true even when a the dual does not have a unique solution. Proof given in Appendix C.

Procedurally Algorithm 1 starts by computing Q = ηA t At and bt to form a master system Qx = b. We consider each of the 2N 1 subsystems of Qx = bt deﬁned by an element of the set set I (lines 1-3), where I represents the set of none-zero dimensions of each subsystem. For each subsystem we get a independent subproblem. To ensure the solution to each subproblem will satisfy PN n=1 xn = 1 and all partial derivatives have equal value an extra row-column is introduced to each system (line 2). We then compute the point which satisﬁes the optimality conditions detailed in Proposition 1 for each subsystem, by solving for x. We then check if each of these points is feasible, that is, belongs to N by examining signs xn 0, n {1, ..., n} (line 4). Note, we have by construction PN n=1 xn = 1. Finally, we select α as the feasible point with maximum dual value (lines 7-8). The optimal α is then used to deﬁne the weight update (8). For example, if α = [1, 0, ..., 0] an SGD step wt+1 = wt η ℓzt(wt) will be taken.

A Stochastic Bundle Method for Interpolation

The BORAT algorithm can be viewed as automatically picking the best step out of a maximum of (2N 1) possible options, where each option has a closed form solution. Although the computational complexity of this method is exponential in N, this algorithm is still very eﬃcient in practice for two reasons. First, we only consider small N. Second, as sub problems can be solved independently, it permits a parallel implementation. With a fully parallel implementation the time complexity of this algorithm reduces to O N3 . Empirically with such an implementation, for N 10 we observe approximately a linear relationship between N and time taken per training epoch. See Appendix F for a comparison of training epoch time between BORAT and SGD.

Algorithm 1 Dual Optimisation Algorithm Require: η, N, P = {S : S {1, 2, ..., N}, S = }, Q = ηA t At, bt, dmax = 0.

1: for I P do

2: ˆQ = Q[I I] 1 1 0

, ˆb = b[I] 1

see Equation (17) for deﬁnitions

3: φ[I] = solvex ˆQx = ˆb solve the subsystem, see Equation (18)

4: if φ[I] i 0, i {1, 2, ..., |I|} then check for non negativity of solution

5: ψ(I) = select(φ[I], I) select elements according to Equation (19)

2ψ(I) Qψ(I) + ψ(I) bt compute the dual value

7: if d dmax then save maximum value

8: dmax = d, α = ψ(I)

9: Return α return optimal value

3.7 Computation considerations

While the method of adding linear approximations detailed in (13) requires running Algorithm 1 at each inner loop iteration, when adding an additional element αn if we keep track of the best dual value we need only compute the 2N 1 new subproblems that include non-zero αn. Thus we only need to run Algorithm 1 once for each wt update, that is once per N 1 batches.

3.8 Summary of Algorithm

The full BORAT method is outlined in Algorithm 2. The bundle is constructed in lines 4-6. The update is obtained in lines 7-9 using Algorithm 1. Finally, the updated parameters are projected to the feasible region Ωin line 9. In the majority of our experiments, we accelerate BORAT with Nesterov momentum. We use Nesterov momentum as we ﬁnd this helps produce strong generalise performance. The update step at line 9 of Algorithm 2 is then replaced by (i) a velocity update vt = µvt 1 ηAtα and (ii) a parameter update wt+1 = ΠΩ(wt + µvt).

Paren, Berrada, Poudel and Kumar

Algorithm 2 The BORAT Algorithm Require: maximal learning-rate η, maximum bundle size N 2, initial feasible w0 Ω.

2: while not converged do

3: Get ℓzt( ˆw1 t ), ℓzt( ˆw1 t ) with zt drawn i.i.d.

4: for n = 2, ..., N 1 do sample additional points

5: Sample zt,n Z, ℓzn t ( ˆwn t ), ℓzn t ( ˆwn t )

6: compute ˆwn+1 t according to (13)

7: compute ηA t At and bt 8: α = argmaxα N η

2α A t Atα + α bt see Algorithm 1 for and details

9: wt+1 = ΠΩ(wt ηAtα ) here ΠΩis the projection onto Ω

10: t = t + 1

11: end while

4. Justiﬁcation and Analysis

The interpolation setting gives by deﬁnition, f = 0. However, more subtly, it also allows the updates to rely on the stochastic estimate ℓzt(wt) rather than the exact but expensive f(wt). Intuitively, this is possible because in the interpolation setting, we know the global minimum is achieved for each loss function ℓzt(wt) simultaneously. The following results formalise the convergence guarantee of BORAT in the stochastic setting. Note here we prove these result for BORAT with the minor modiﬁcation, that is, all linear approximations are formed using the same mini-batch of data, ℓzn t = ℓzt for all n {2, ..., N 1}. First, we consider the standard convex setting, where we additionally assume the interpolation assumption is satisﬁed and that each ℓz is Lipschitz continuous. Next we consider an important class of non-convex functions used for analysis in previous works related to interpolation (Vaswani et al., 2019a).

Theorem 1 (Convex and Lipschitz)

Let Ωbe a convex set. We assume that for every z Z, ℓz is convex and C-Lipschitz. Let w be a solution of (P), and assume that we have perfect interpolation: z Z, ℓz(w ) = 0. Then BORAT for N 2 applied to f satisﬁes:

(T + 1) + w0 w 2

η(T + 1) . (22)

Hence BORAT recovers the same asymptotic rate as SGD without the need to reduce the learning rate η. In the Appendix D we show that for convex and β-smooth and the α-strongly convex settings BORAT recovers rates of O(1/T) and O(exp(αT)) respectively. Note the earlier version of this work provides convergence results for ALI-G without perfect interpolation. We follow earlier work (Vaswani et al., 2019a) and provide a convergence rate for BORAT applied to non-convex functions that satisfy the Restricted Secant Inequality (RSI). A function is said to satisfy the RSI condition with constant µ over a the set Ωif the following

A Stochastic Bundle Method for Interpolation

w Ω, ℓz(w), w w µ||w w ||. (23)

Theorem 2 (RSI )

We consider problems of type (P). We assume ℓz satisﬁes the RSI with constant µ, smoothness constant β and perfect interpolation e.g. lz(w ) = 0, z Z. Then if set η ˆη = min{ 1

4µ, µ 2β2 } then in the worst case we have:

f(w T+1) f exp 3

8 ˆηµ T ||w0 w ||2. (24)

In this setting BORAT recovers the same asymptotic rate as SGD.

5. Related work

Bundle Methods. Bundle methods have been primarily proposed for the optimisation of non-smooth convex functions (Lemaréchal et al., 1995, Smola et al., 2007, Auslender, 2009). However, these works do not treat the stochastic case, as they consider small problems where the full gradient can be cheaply evaluated. To our knowledge our work is the ﬁrst to propose a bundle method for the optimisation of stochastic non-convex problems.

Interpolation in Deep Learning. The interpolation property of DNNs was utilised by early eﬀorts to analyse the convergence speed of SGD. These works demonstrate that SGD achieves the convergence rates of full-batch gradient descent in the interpolation setting (Ma et al., 2018a, Vaswani et al., 2019b, Zhou et al., 2019). Such works are complementary to ours in the sense that they provide a convergence analysis of an existing algorithm for deep learning. In a diﬀerent line of work, Liu et al. (2019a) propose to exploit interpolation to prove convergence of a new acceleration method for deep learning. However, their experiments suggest that the method still requires the use of a hand-designed learning-rate schedule.

Adaptive Gradient Methods. Similarly to BORAT, most adaptive gradient methods also rely on tuning a single hyperparameter, thereby providing a more pragmatic alternative to SGD that needs a speciﬁcation of the full learning-rate schedule. While the most popular ones are Adagrad (Duchi et al., 2011), RMSPROP (Tieleman and Hinton, 2012), Adam (Kingma and Welling, 2014) and AMSGrad (Reddi et al., 2018), there have been many other variants (Zeiler, 2012, Orabona and Pál, 2015, Défossez and Bach, 2017, Levy, 2017, Mukkamala and Hein, 2017, Zheng and Kwok, 2017, Bernstein et al., 2018, Chen and Gu, 2018, Shazeer and Stern, 2018, Zaheer et al., 2018, Chen et al., 2019, Loshchilov and Hutter, 2019, Luo et al., 2019). However, as pointed out in (Wilson et al., 2017), adaptive gradient methods tend to give poor generalization in supervised learning. In our experiments, the results provided by BORAT are signiﬁcantly better than those obtained by the most popular adaptive gradient methods. Recently, Liu et al. (2019b) have proposed to rectify Adam with a learning-rate warm up, which partly bridges the gap in generalization performance between Adam and SGD. However, their method still requires a learning-rate schedule, and thus remains diﬃcult to tune on new tasks.

Paren, Berrada, Poudel and Kumar

Adaptive Learning-Rate Algorithms. Vaswani et al. (2019b) show that one can use line search in a stochastic setting for interpolating models while guaranteeing convergence. This work is complementary to ours, as it provides convergence results with weaker assumptions on the loss function, but is less practically useful as it requires up to four hyperparameters, instead of one for BORAT. Less closely related methods, included second-order ones, adaptively compute the learning-rate without using the minimum (Schaul et al., 2013, Martens and Grosse, 2015, Tan et al., 2016, Zhang et al., 2017, Baydin et al., 2018, Wu et al., 2018, Li and Orabona, 2019, Henriques et al., 2019), but do not demonstrate competitive generalization performance against SGD with a well-tuned hand-designed schedule.

L4 Algorithm. The L4 algorithm (Rolinek and Martius, 2018) also uses a modiﬁed version of the Polyak step-size. However, the L4 algorithm computes an online estimate of f rather than relying on a ﬁxed value. This requires three hyperparameters, which are in practice sensitive to noise and crucial for empirical convergence of the method. In addition, L4 does not come with convergence guarantees. In contrast, by utilizing the interpolation property and a single learning rate, our method is able to (i) provide reliable and accurate minimization with only a single hyperparameter, and (ii) oﬀer guarantees of convergence in the stochastic convex setting.

Frank-Wolfe Methods. The proximal interpretation in Equation (10) allows us to draw additional parallels between ALI-G and existing methods. In particular, the formula of the learning-rate α1 may remind the reader of the Frank-Wolfe algorithm (Frank and Wolfe, 1956) in some of its variants (Locatello et al., 2017), or other dual methods (Lacoste-Julien and Jaggi, 2013, Shalev-Shwartz and Zhang, 2016). This is because such methods solve in closed form the dual of problem (10), and problems in the form of (10) naturally appear in dual coordinate ascent methods (Shalev-Shwartz and Zhang, 2016). When no regularization is used, ALI-G and Deep Frank-Wolfe (DFW) (Berrada et al., 2019a) are procedurally identical algorithms. This is because in such a setting, one iteration of DFW also amounts to solving (10) in closed-form more generally, DFW is designed to train deep neural networks by solving proximal linear support vector machine problems approximately. However, we point out the two fundamental advantages of BORAT over DFW: (i) BORAT can handle arbitrary (lower-bounded) loss functions, while DFW can only use convex piece-wise linear loss functions; and (ii) as seen previously, BORAT provides convergence guarantees in the convex setting.

SGD with Polyak s Learning-Rate. Oberman and Prazeres (2019) extend the Polyak step-size to rely on a stochastic estimation of the gradient ℓzt(wt) only, instead of the expensive deterministic gradient f(wt). However, they still require evaluating f(wt), the objective function over the entire training data set, in order to compute its learningrate, which makes the method impractical. In addition, since they do not do exploit the interpolation setting nor the fact that regularization can be expressed as a constraint, they also require the knowledge of the optimal objective function value f . We also refer the interested reader to the recent analysis of Loizou et al. (2020), which provides a set of improved theoretical results.

a Prox Algorithm. Most similar to this work (Asi and Duchi, 2019) have recently introduced the a Prox algorithm, a family of proximal stochastic optimisation algorithms for

A Stochastic Bundle Method for Interpolation

convex problems. The a Prox truncated model and the a Prox relatively accurate models share aspects to ALI-G and BORAT respectively. However, there are four clear advantages of this work over (Asi and Duchi, 2019) in the interpolation setting, in particular for training neural networks. First, our work is the ﬁrst to empirically demonstrate the applicability and usefulness of the algorithm on varied modern non-convex deep learning tasks most of our experiments use several orders of magnitude more data and model parameters than the small-scale convex problems of (Asi and Duchi, 2019). Second, our analysis and insights allow us to make more aggressive choices of learning rate than (Asi and Duchi, 2019). Indeed, (Asi and Duchi, 2019) assume that the maximal learning-rate is exponentially decaying, even in the interpolating convex setting. In contrast, by avoiding the need for an exponential decay, the learning-rate of BORAT requires only one hyperparameters instead of two for a Prox. Third, our analysis proves fast convergence in function space rather than iterate space. Fourth, unlike BORAT Asi and Duchi (2019) do not include the global lower bound in their a Prox relatively accurate models and does not provide details on how to solve each proximal problem eﬃciently.

6. Experiments

We split the experimental results into two sections. The ﬁrst section demonstrates the strong generalisation performance of ALI-G and BORAT on a wide range of tasks. Here we compare ALI-G and BORAT to other single hyperparameter optimisation algorithms. The second section shows for tasks where SGD and ALI-G are sensitive to the learning rate, using larger N increases both the robustness to the learning rate and task regularisation hyperparameter. For experiments we chosen to investigate BORAT with (N = 3) and (N = 5) and refer to the resulting algorithms as BORAT3 and BORAT5 respectively. For N 2 BORAT uses more than one batch of data for each update. In order to give a fair comparison we keep the number of passes through the data constant for all experiments. This has the eﬀect that BORAT3 and BORAT5 respectively make a half and a quarter of the weight updates of SGD and ALIG. The time per epoch of BORAT is very similar to that of all other methods, see Appendix F for more details. Hence all methods approximately have the same time cost per epoch. Consequently, faster convergence in terms of number of epochs translates into faster convergence in terms of wall-clock time. The code to reproduce our results is publicly available . For baselines we use the oﬃcial implementation where available in Py Torch (Paszke et al., 2017). We use our implementation of L4, which we unit-test against the oﬃcial Tensor Flow implementation (Abadi et al., 2015). we employ the oﬃcial implementation of DFW and we re-use their code for the experiments on SNLI and CIFAR.

6.1 Eﬀectiveness of ALI-G and BORAT

We empirically compare ALI-G and BORAT to the optimisation algorithms most commonly used in deep learning using standard loss functions. Where the hyperparameters of each algorithm are cross validated to select a best performing model. We consider a wide range of

https://github.com/oval-group/borat https://github.com/oval-group/dfw

Paren, Berrada, Poudel and Kumar

problems: training a wide residual network on SVHN, training a Bi-LSTM on the Stanford Natural Language Inference data set, training both wide residual networks and densely connected networks on the CIFAR data sets and lastly training a wide residual network on the Tiny Imagenet data set. For these problems, we demonstrate that ALI-G (N = 2) and BORAT for N {3, 5} obtains comparable performance to SGD with a hand-tuned learning rate schedule, and typically outperforms adaptive gradient methods. Finally, we empirically assess the performance of BORAT and its competitors in terms of training objective on CIFAR-100 and Image Net, in order to demonstrate the scalability of BORAT to large-scale settings. Note that the tasks of training wide residual networks on SVHN and CIFAR-100 are part of the Deep OBS benchmark (Schneider et al., 2019), which aims at standardising baselines for deep learning optimisers. In particular, these tasks are among the most diﬃcult ones of the benchmark because the SGD baseline beneﬁts from a manual schedule for the learning rate where as ALI-G and BORAT uses a single ﬁxed value. Despite this, our set of experiments demonstrate that ALI-G and BORAT obtains competitive performance in relation to SGD. In addition, our methods signiﬁcantly outperforms adaptive gradient methods. All Experiments were performed on a single GPU (SVHN, SNLI, CIFAR) or on up to 4 GPUs (Image Net).

6.1.1 Wide Residual Networks on SVHN

Setting. The SVHN data set contains 73k training samples, 26k testing samples and 531k additional easier samples. From the 73k diﬃcult training examples, we select 6k samples for validation; we use all remaining (both diﬃcult and easy) examples for training, for a total of 598k samples. We train a wide residual network 16-4 following (Zagoruyko and Komodakis, 2016).

Method. For SGD, we use the manual schedule for the learning rate of Zagoruyko and Komodakis (2016). For L4Adam and L4Mom, we cross-validate the main learning-rate hyperparameter α to be in {0.0015, 0.015, 0.15} (0.15 is the value recommended by (Rolinek and Martius, 2018)). For other methods, the learning rate hyperparameter is tuned as a power of 10. The ℓ2 regularization is cross-validated in {0.0001, 0.0005} for all methods but BORAT. For BORAT, the regularization is expressed as a constraint on the ℓ2-norm of the parameters, and its maximal value is set to 100. SGD, BORAT and BPGrad use a Nesterov momentum of 0.9. All methods use a dropout rate of 0.4 and a ﬁxed budget of 160 epochs, following (Zagoruyko and Komodakis, 2016). A batch size of 128 is used for all experiments.

Results. A comparison to other methods is presented in Table 1. On this relatively easy task, most methods achieve about 98% test accuracy. Despite the cross-validation, L4Mom does not converge on this task. However, note that L4Adam achieves accurate results. Even though SGD beneﬁts from a hand-designed schedule, BORAT and other adaptive methods obtain comparable performance on this task.

6.1.2 Bi-LSTM on SNLI

Setting. We train a Bi-LSTM of 47M parameters on the Stanford Natural Language Inference (SNLI) data set (Bowman et al., 2015). The SNLI data set consists of 570k pairs of sentences, with each pair labeled as entailment, neutral or contradiction. This large scale

A Stochastic Bundle Method for Interpolation

Test Accuracy on SVHN (%)

Adagrad 98.0 BPGrad 98.1 AMSGrad 97.9 L4Adam 98.2 DFW 98.1 ALI-G 98.1 L4Mom 19.6 BORAT3 98.1 Adam 97.9 BORAT5 98.1

SGD 98.3 SGD 98.4

Table 1: In red, SGD beneﬁts from a hand-designed schedule for its learning-rate. In black, adaptive methods, including ALI-G, have a single hyperparameter for their learning-rate. SGD refers to the performance reported by (Zagoruyko and Komodakis, 2016).

data set is commonly used as a pre-training corpus for transfer learning to many other natural language tasks where labeled data is scarcer (Conneau et al., 2017) much like Image Net is used for pre-training in computer vision. We follow the protocol of (Berrada et al., 2019a) and we re-use their results for the baselines.

Method. For L4Adam and L4Mom, the main hyperparameter α is cross-validated in {0.015, 0.15} compared to the recommended value of 0.15, this helped convergence and considerably improved performance. The SGD algorithm beneﬁts from a hand-designed schedule, where the learning-rate is decreased by 5 when the validation accuracy does not improve. Other methods use adaptive learning-rates and do not require such a schedule. The value of the main hyperparameter η is cross-validated as a power of ten for the BORAT algorithm and for previously reported adaptive methods. Following the implementation by (Conneau et al., 2017), no ℓ2 regularization is used. The algorithms are evaluated with the Cross-Entropy (CE) loss and the multi-class hinge loss (SVM), except for DFW which is designed for use with an SVM loss only. For all optimisation algorithms, the model is trained for 20 epochs, using a batch size of 64, following (Conneau et al., 2017).

Results. Table 2 compares ALI-G and BORAT against the other optimises. Moreover, ALI-G, which requires a single hyperparameter for the learning-rate, outperforms all other methods for both the SVM and the CE loss functions. BORAT3 and BORAT5 achieve the same results to ALI-G for both losses.

6.1.3 Wide Residual Networks and Densely Connected Networks on CIFAR

Setting. We follow the methodology of our previous work (Berrada et al., 2019a). We test two architectures: a Wide Residual Network (WRN) 40-4 (Zagoruyko and Komodakis, 2016) and a bottleneck Dense Net (DN) 40-40 (Huang et al., 2017). We use 45k samples for training and 5k for validation. The images are centred and normalized per channel. We apply standard data augmentation with random horizontal ﬂipping and random crops.

Method. We compare ALI-G and BORAT to other common single hyperparameter optimisers. Here we cross validate the hyperparameters in order to ﬁnd a best performing network for each method. AMSGrad was selected in (Berrada et al., 2019a) because it was the best adaptive method on similar tasks, outperforming in particular Adam and Adagrad.

Paren, Berrada, Poudel and Kumar

Test Accuracy on SNLI (%)

CE SVM CE SVM

Adagrad 83.8 84.6 Adam 84.5 85.0 AMSGrad 84.2 85.1 BPGrad 83.6 84.2 DFW - 85.2 L4Adam 83.3 82.5 L4Mom 83.7 83.2 BORAT3 84.4 85.2 ALI-G 84.8 85.2 BORAT5 84.5 85.2

SGD 84.7 85.2 SGD 84.5 -

Table 2: In red, SGD beneﬁts from a hand-designed schedule for its learning-rate. In black, adaptive methods have a single hyperparameter for their learning-rate. With an SVM loss, DFW and ALI-G are procedurally identical algorithms but in contrast to DFW, ALI-G can also employ the CE loss. Methods in the format X re-use results from (Berrada et al., 2019a). SGD is the result from (Conneau et al., 2017).

In addition to these baselines, we also provide the performance of L4Adam, L4Mom, Adam W (Loshchilov and Hutter, 2019) and Yogi (Zaheer et al., 2018). We follow the cross validation scheme of (Berrada et al., 2019a) restating it here for completeness, All methods. employ the CE loss only, except for the DFW algorithm, which is designed to use an SVM loss. The batch-size is cross-validated in {64, 128, 256} for the DN architecture, and {128, 256, 512} for the WRN architecture. For L4Adam and L4Mom, the learning-rate hyperparameter α is cross-validated in {0.015, 0.15}. For AMSGrad, Adam W, Yogi, DFW, ALI-G and BORAT the learning-rate hyperparameter η is cross-validated as a power of 10 (in practice η {0.1, 1} for BORAT). SGD, DFW and BORAT use a Nesterov momentum of 0.9. For all methods excluding ALI-G, BORAT and Adam W, the ℓ2 regularization is cross-validated in {0.0001, 0.0005} on the WRN architecture, and is set to 0.0001 for the DN architecture. For Adam W, the weight-decay is cross-validated as a power of 10. For ALI-G, BORAT, ℓ2 regularization is expressed as a constraint on the norm of the vector of parameters; and its value is cross validated in {50, 75, 100}. For all optimisation algorithms, the WRN model is trained for 200 epochs and the DN model for 300 epochs, following (Zagoruyko and Komodakis, 2016) and (Huang et al., 2017) respectively.

Results. Table 3 details the results of the comparison of ALI-G and BORAT against other single hyperparameter optimisers for the CE loss only. In this setting, ALI-G and BORAT outperforms AMSGrad, Adam W, Yogi, L4Adam and L4 Mom and constant step size SGD by a large margin. This is true for all values of N. The second best method is DFW which also has the restriction that it can only be used in conjunction with the hinge loss. ALI-G and BORAT produce the test accuracy achievable using SGD with the manual learning rate schedules from (Huang et al., 2017) and (Zagoruyko and Komodakis, 2016) for half of the model data set combinations considered here. For these tasks BORAT provides state of the art results without the need for the learning rate to be manually adapted through out training. For the remaining two combinations; training a DN on Cifar10 and training a WRN on Cifar100, BORAT lags in performance by approximately 0.2% and 2% respectively. With minor variation depending on which version of BORAT is used. Note SGD with a

A Stochastic Bundle Method for Interpolation

hand tuned learning rate schedule provides a reasonable upper limit of the generalisation performance achievable due to the amount of time that has been put into improving the schedule.

Test Accuracy on CIFAR (%)

CIFAR-10 CIFAR-100

WRN DN WRN DN

SGD 91.2 91.5 67.8 67.5 AMSGrad 90.8 91.7 68.7 69.4 Adam W 92.1 92.6 69.6 69.5 Yogi 91.2 92.1 68.7 69.6 DFW 94.2 94.6 76.0 73.2 L4Adam 90.5 90.8 61.7 60.5 L4Mom 91.6 91.9 61.4 62.6 ALI-G 95.4 94.5 76.1 76.2 BORAT3 95.4 94.9 76.0 76.5 BORAT5 95.0 94.9 75.8 75.7

Table 3: Test accuracy of single hyperparameter optimisation methods. Each reported result is an average over three independent runs; the standard deviations and optimal hyperparameters are reported in Appendix F (the standard deviations are at most 0.4 for ALI-G and BORAT).

6.1.4 Wide Residual Networks on Tiny Image Net

Setting. Tiny Image Net contains 200 classes for training where each class has 500 images. The validation set contains 10,000 images. All images are RGB with 64x64 pixels. The test-set contains 10,000 images however ground truth labels are not freely available. We again use the Wide Residual Network (WRN) detailed in Section 6.1.1. The images are centred and normalized per channel. We apply standard data augmentation with random horizontal ﬂipping and random crops.

Method. We investigate using SGD (with a constant step-size), ALI-G and BORAT to train a WRN on Tiny Image Net. We use Adam (Kingma and Ba, 2015) as an indicator of what can be expected from a popular adaptive method applied to the same task. We make use of both the cross entropy (CE) and multi-class hinge (SVM) losses. The learning-rate hyperparameter η is cross validated in powers of 10, a batch-size of 128 and a training time of 200 epochs was used for all experiments. The ℓ2 regularisation is cross validated in powers of 10 for ADAM. For constant step-size SGD, ALI-G and BORAT we make use of the constrained base regularisation detailed in Section 2, cross validating r {50, 100, 150, 200, 250}. Additionally all methods excluding Adam use a Nesterov momentum of 0.9.

Results. The best results for SGD, ALI-G, BORAT and Adam are shown in table 4. For both losses Adam performs worst, only achieving 2.4% validation accuracy when optimising the SVM loss. The best results are achieved by BORAT with N = 5 and N = 3 for the

Paren, Berrada, Poudel and Kumar

CE and SVM losses respectively. These results suggest the added gain of BORAT is more pronounced for larger more challenging data sets.

Validation Accuracy on Tiny Image Net (%)

CE Loss Hinge Loss

Adam 55.0 2.4 SGD 59.4 23.2 ALI-G 61.1 24.9 BORAT3 61.1 44.1 BORAT5 62.1 39.4

Table 4: Validation accuracy of single hyperparameter optimisation methods on the Tiny Image Net data set for cross entropy and hinge loss.

6.1.5 Comparing Training Performance on CIFAR-100

In this section, we empirically assess the performance of ALI-G and its competitors in terms of training objective on CIFAR-100. In order to have comparable objective functions, the ℓ2 regularization is deactivated. The learning-rate is selected as a power of ten for best ﬁnal objective value, and the batch-size is set to its default value. For clarity, we only display the performance of SGD, Adam, Adagrad and BORAT (DFW does not support the cross-entropy loss). The L4 methods diverge in this setting. Here SGD uses a constant learning-rate to demonstrate the need for adaptivity. Therefore all methods use one hyperparameter for their learning-rate. All methods use a ﬁxed budget of 200 epochs for WRN-CIFAR-100 and 300 epochs for DN-CIFAR-100. As can be seen, ALI-G and BORAT provides better training performance than the baseline algorithms on both tasks. Here BORAT3 and BORAT5 are slightly slower than ALI-G due to the fewer number of parameter updates.

0 25 50 75 100 125 150 175 200

Training Objective

WRN-CIFAR100

Adagrad Adam SGD ALI-G BORAT3 BORAT5

0 50 100 150 200 250 300

DN-CIFAR100

Figure 4: Objective function over the epochs on CIFAR-100 (smoothed with a moving average over 5 epochs). ALI-G and BORAT reaches a value that is an order of magnitude better than the baselines.

A Stochastic Bundle Method for Interpolation

6.1.6 Training at Large Scale

We demonstrate the scalability of BORAT by training a Res Net18 (He et al., 2016) on the Image Net data set. In order to satisfy the interpolation assumption, we employ a loss function tailored for top-5 classiﬁcation (Lapin et al., 2016), and we do not use data augmentation. Our focus here is on the training objective and accuracy. ALI-G and BORAT uses the following training setup: a batch-size of 1024 split over 4 GPUs, a ℓ2 maximal norm of 400 for w, a maximal learning-rate of 10 and no momentum. As can be seen in ﬁgure 5, ALI-G and BORAT reaches 99% top-5 accuracy in 12 epochs, and accurately minimises the objective function to 2 10 4 within 90 epochs.

0 10 20 30 40 50 60 70 80 90 10 4

Training Objective

0 10 20 30 40 50 60 70 80 90 40

Training Top-5 Accuracy (%)

ALI-G BORAT3 BORAT5

Figure 5: Training performance of a Res Net-18 learning with diﬀerent versions of BORAT on Image Net. Note, that all versions converge at similar rates even though for larger values of N, BORAT3 and BORAT5 makes far fewer updates.

6.2 Robustness of BORAT

In this section we show that using additional linear approximations increases the robustness of BORAT to its learning rate η and the problem regularisation amount r. BORAT produces high accuracy models for a wider range of values than SGD and ALI-G. Hence on problems where SGD and ALI-G are sensitive to their learning rate BORAT with N 2 produces the best performance. These results demonstrate the advantage of using a larger bundle to model the loss in each proximal update. In order to illustrate this increased robustness we assess the stability of constant step size SGD, ALI-G, BORAT3 and BORAT5 to their hyperparameters. This is done by completing a grid search over η and r for a number of tasks while holding the batch size and epoch budget constant. This allows us to assess the range of values where these algorithms produce high accuracy models. We perform this grid search for six tasks split over the CIFAR100 and Tiny Image Net data sets. We choose these data sets as they are challenging yet a model capable of interpolation can still ﬁt on a single GPU. We compare against SGD with a constant learning rate for two reasons. First, SGD eﬀectively uses a bundle of size 1, composed of only the linearization around the current iterate. Second, constant step size SGD has a single

Paren, Berrada, Poudel and Kumar

learning rate hyperparameter with the same scale as BORAT and also permits the constraint based regularisation described in Section 2.

6.2.1 Wide Residual Networks on CIFAR100 and Tiny Image Net

Setting. For a description of the CIFAR100 and Tiny Imagenet data sets please refer to sections 6.1.3 and 6.1.4 respectively.

Method. We run four separate experiments on the CIFAR100 data set. The ﬁrst two of these experiments examine the robustness to hyperparameters of SGD, ALI-G and BORAT when in combination with the cross entropy (CE) and multi-class hinge (SVM) losses. For the second two experiments on CIFAR100 we investigate the same algorithms in the presents of label noise. We limit our self to the CE loss for these two experiments. Label noise is applied by switching the label of the images in the training set with probability p to a random class label in the same super class. We repeat the experiments without label noise on the more challenging Tiny Image Net data set, resulting in six diﬀerent settings. For all algorithms we train a wide residual network (WRN) detailed in Section 6.1.3 and make use of a Nesterov momentum of 0.9 as we found its use produced superior generalisation performance. For all experiments we use a batch size of 128 and perform a grid search over the 20 hyper parameter combinations given by r {50, 100, 150, 200, 250} and η {0.01, 0.1, 1.0, 10.0}.

A Stochastic Bundle Method for Interpolation

50 100 150 200 250

1e-2 1e-1 1e0 1e1

Step Size (η)

68.9 70.9 69.6 69.8 69.9

55.7 66.0 70.7 71.9 72.5

20.2 43.1 54.7 60.8 64.8

1.0 1.0 1.7 1.0 1.0

SGD (N = 1)

50 100 150 200 250

1e-2 1e-1 1e0 1e1

CE Loss - Test Accuracy (%)

74.4 71.8 69.6 70.2 70.5

56.5 76.3 73.3 73.2 72.5

22.8 43.9 54.4 61.3 64.2

1.0 1.0 1.0 1.0 1.0

ALI-G (N = 2)

50 100 150 200 250

Hyperparameter Controlling the Regularisation (r)

1e-2 1e-1 1e0 1e1

73.4 70.1 67.8 67.9 67.8

63.9 73.8 72.4 72.6 72.5

31.7 54.0 64.4 74.8 72.8

5.8 13.5 26.1 33.8 41.2

BORAT (N = 3)

50 100 150 200 250

1e-2 1e-1 1e0 1e1

71.8 67.5 65.3 65.6 65.2

76.4 72.5 71.0 71.3 71.2

46.8 62.5 74.3 72.5 72.2

15.4 30.3 42.7 52.0 55.7

BORAT (N = 5)

50 100 150 200 250

1e-2 1e-1 1e0 1e1

Step Size (η)

15.3 9.2 7.9 9.8 8.7

43.8 39.4 32.1 32.2 34.5

1.0 2.0 1.6 1.0 1.0

1.0 1.0 1.0 1.0 1.0

50 100 150 200 250

1e-2 1e-1 1e0 1e1

Hinge Loss - Test Accuracy (%)

15.8 10.8 8.5 7.7 7.3

47.7 48.9 38.9 42.0 37.2

1.5 1.3 1.6 1.0 2.2

1.0 1.0 1.0 1.0 1.0

50 100 150 200 250

Hyperparameter Controlling the Regularisation (r)

1e-2 1e-1 1e0 1e1

12.2 5.7 5.3 5.1 5.6

44.0 35.5 34.6 41.8 36.5

1.0 49.1 60.1 69.4 64.6

1.0 1.0 1.4 1.0 1.0

50 100 150 200 250

1e-2 1e-1 1e0 1e1

5.4 4.9 4.4 4.4 5.2

35.5 25.6 23.7 17.3 20.4

41.9 62.3 64.3 59.3 60.0

1.0 1.0 1.0 1.0 1.0

Figure 6: Comparison of SGD and BORAT s robustness to hyperparameters when increasing N for the CE and SVM losses. Experiments are performed on the CIFAR100 data set. Colour represents test performance, where darker colours correspond to higher values. For both losses, increasing N allows for higher learning rates to be used while still producing convergent behaviour. Additionally for the CE loss and η {1.0, 10.0} larger N allows for a smaller r or greater levels of regularisation to be used. Consequently increasing N improves the over all robustness to hyperparameters, while not sacriﬁcing generalisation performance.

Results. Figure 6 details the robustness of SGD, ALI-G and BORAT for the CE and SVM losses without label noise. In these two settings ALI-G exhibits similar robustness to SGD, however it outperforms SGD in terms of the performance of the best model trained. On the CE loss SGD and ALI-G are reasonably robust, providing good results for the majority of hyperparameter combinations with η 1. For the SVM loss all algorithms are sensitive to their learning rate η. SGD and ALI-G only produce an increase in accuracy for small values of η, hence no model achieves high accuracy in the 200 epoch budget. For both losses, increasing N improves the robustness and produces convergent behaviour for larger values of η and smaller values of r. This is particularly pronounced for the SVM loss. Here permitting larger η allows for a high accuracy network to be trained within the 200 epoch budget. For both losses and η = 0.01 BORAT3 and BORAT5 slightly under perform compared to ALI-G and SGD. This is simply a consequence of these methods making signiﬁcantly fewer updates. In spite of this the resultant eﬀect of increasing the bundle size is a larger range of hyperparameters that produce good generalisation performance. The results from the label noise experiments are shown in Figure 7. For p = 0.1 the results closely mirror those where no label noise is used, shown in Figure 6, however here all models achieve roughly 0 5% worse test performance. When the noise is increased to p = 0.5 the test error increases drastically by 0 25%. In both cases p = 0.1 and p = 0.5, increasing N

Paren, Berrada, Poudel and Kumar

permits to obtain good results for a larger range of values of η and r. Interestingly, using N = 3, 5 results in the best performance when p = 0.5. This is somewhat expected since using a larger bundle size means more samples are used to calculate each parameter update and hence helps reduce the eﬀect of the label noise.

50 100 150 200 250

1e-2 1e-1 1e0 1e1

Step Size (η)

63.8 65.4 63.8 64.0 63.0

55.3 58.9 59.9 63.9 66.4

21.4 35.1 42.1 53.0 56.1

1.7 1.0 1.0 1.0 1.0

SGD (N = 1)

50 100 150 200 250

1e-2 1e-1 1e0 1e1

Label Nosie (p = 0.1)

70.9 66.8 65.1 64.6 66.0

41.8 59.1 68.4 67.9 68.3

14.7 33.4 43.4 56.4 58.7

1.0 1.0 1.0 1.0 1.0

ALI-G (N = 2)

50 100 150 200 250

Hyperparameter Controlling the Regularisation (r)

1e-2 1e-1 1e0 1e1

70.0 65.3 63.2 63.2 62.5

59.4 70.4 68.3 67.1 66.9

19.2 47.3 57.1 57.9 67.7

6.5 14.4 24.0 32.9 39.5

BORAT (N = 3)

50 100 150 200 250

1e-2 1e-1 1e0 1e1

67.7 63.2 60.8 60.0 60.7

61.3 67.8 67.0 67.0 65.8

41.4 54.3 68.2 67.7 66.6

14.6 27.6 40.0 48.5 50.2

BORAT (N = 5)

50 100 150 200 250

1e-2 1e-1 1e0 1e1

Step Size (η)

40.3 42.6 41.0 43.0 43.1

47.3 46.3 35.8 40.8 42.1

10.2 28.9 39.5 47.0 42.2

1.0 1.0 1.0 7.9 1.0

50 100 150 200 250

1e-2 1e-1 1e0 1e1

Label Nosie (p = 0.5)

42.3 45.8 44.5 43.7 44.1

45.9 45.2 47.8 48.2 47.2

6.6 26.4 44.5 43.0 43.5

1.0 1.0 1.0 1.0 1.0

50 100 150 200 250

Hyperparameter Controlling the Regularisation (r)

1e-2 1e-1 1e0 1e1

49.0 44.5 42.1 41.7 42.9

56.3 46.8 47.0 46.2 45.4

20.0 40.8 54.8 45.9 37.1

2.5 8.2 13.5 22.0 23.1

50 100 150 200 250

1e-2 1e-1 1e0 1e1

46.6 41.1 39.8 39.0 39.3

52.1 46.4 44.7 46.2 44.5

31.9 51.9 42.8 44.0 45.8

5.2 19.4 26.3 35.0 40.3

Figure 7: Test accuracy of SGD, ALI-G, BORAT3 and BORAT5 robustness to hyperparameters when trained on CIFAR100 with noisy labels. Here we give results with two diﬀerent levels of noise p = 0.1 (upper row) and p = 0.5 (lower row). For readability purposes, the colour encodes test accuracy, where darker colours correspond to higher values. Increasing N allows for higher learning rates and greater levels of regularisation to be used. Additionally, when the level of noise is high (p = 0.5, lower row), BORAT3 and BORAT5 signiﬁcantly outperforms SGD (N = 1) and ALI-G (N = 2).

Figure 8 details the robustness of SGD, ALI-G and BORAT for the Tiny Imagenet experiments. These results show BORAT oﬀers improved robustness of on multiple data sets as we recover similar performance to CIFAR100. For the CE loss increasing N allows for slightly higher learning rates and greater levels of regularisation to be used for the CE loss. For the SVM loss BORAT produces models with reasonable accuracy with η {0.1, 1.0} opposed to SGD and ALI-G that produce poor results for all learning rates excluding η = 1.0. Consequently, when training with the SVM loss for 200 epochs, BORAT produces a drastically better models than SGD and ALI-G. This happens despite BORAT making signiﬁcantly fewer updates of wt in the 200 epochs.

A Stochastic Bundle Method for Interpolation

50 100 150 200 250

1e-2 1e-1 1e0 1e1

Step Size (η)

55.8 59.4 57.0 57.3 57.1

34.7 53.3 54.1 54.6 57.5

8.9 22.9 34.9 43.9 47.3

0.5 0.5 0.5 0.5 0.5

SGD (N = 1)

50 100 150 200 250

1e-2 1e-1 1e0 1e1

CE Loss - Validation Accuracy (%)

55.1 58.9 57.0 57.3 57.7

33.8 53.1 53.4 61.1 60.9

8.7 21.3 34.0 43.7 49.4

0.6 0.5 0.5 3.6 0.5

ALI-G (N = 2)

50 100 150 200 250

Hyperparameter Controlling the Regularisation (r)

1e-2 1e-1 1e0 1e1

54.5 56.8 55.5 55.3 55.1

45.6 55.5 61.1 60.2 59.4

15.4 35.2 47.1 52.6 54.5

2.5 6.1 10.8 15.8 18.7

BORAT (N = 3)

50 100 150 200 250

1e-2 1e-1 1e0 1e1

58.2 54.4 53.2 53.7 53.4

54.4 61.5 59.0 58.2 58.5

29.1 47.3 56.0 56.6 62.1

6.5 15.5 25.1 33.3 39.5

BORAT (N = 5)

50 100 150 200 250

1e-2 1e-1 1e0 1e1

Step Size (η)

6.1 2.4 2.1 1.9 2.2

23.1 23.2 19.3 17.7 17.7

0.5 0.8 0.5 0.5 0.7

0.5 0.5 0.5 0.5 0.5

50 100 150 200 250

1e-2 1e-1 1e0 1e1

Hinge Loss - Validation Accuracy (%)

5.1 2.8 2.3 1.5 2.3

24.9 20.8 16.1 11.0 14.1

0.5 0.5 0.6 0.5 0.5

0.5 0.5 0.5 0.5 0.5

50 100 150 200 250

Hyperparameter Controlling the Regularisation (r)

1e-2 1e-1 1e0 1e1

3.2 1.4 1.9 1.8 1.9

28.5 23.9 17.1 19.4 18.0

0.8 0.5 44.1 38.0 1.0

0.5 0.5 0.5 0.5 0.5

50 100 150 200 250

1e-2 1e-1 1e0 1e1

2.7 1.2 1.4 1.3 1.4

24.3 11.1 7.2 8.0 7.5

25.4 25.9 39.4 38.2 36.5

0.5 0.5 0.5 0.5 0.5

Figure 8: Comparison of SGD, ALI-G and BORAT s robustness to hyperparameters on the Tiny Image Net data set. Here we investigate the aﬀect of increasing the bundle size when in use with the CE and Hinge losses. Colour represents validation performance, where darker colours correspond to higher values. The the CE loss is used BORAT produces an increase in the range of hyperparameters that result in high accuracy models. For the SVM loss BORAT is the only optimiser to produce accurate results

Across all six experiments BORAT consistently produces high accuracy models for a larger range of hyperparameter combinations than SGD or ALI-G. This is particularly pronounced for the SVM loss experiments where the decrease sensitivity of BORAT makes all the diﬀerence between ﬁnding hyperparameters that result in a high accuracy model and not. Consequently, in comparison to its competitors, BORAT is systematically more robust to the choice of learning-rate and regularization hyper-parameter, and also oﬀers better generalization more often than not. Therefore we particularly recommend using BORAT for tasks where hyper-parameter tuning is a highly time consuming task.

7. Discussion

In this work have introduced BORAT a bundle method designed for the optimisation of DNNs capable of interpolation. We detailed a special case of BORAT with a minimal bundle size that we name ALI-G. ALI-G produces a algorithm that automatically decays the steps size as the loss of the iterate approaches its minimal value. By only needing to ﬁnd a good maximal learning rate that is automatically decayed we remove the labour and compute intensive task of ﬁnding a good learning rate schedule. Using standard publicly available data sets, we have shown that a ALI-G is highly eﬀective in a number of settings. When tuning a

Paren, Berrada, Poudel and Kumar

single hyperparameter for the learning-rate, ALI-G achieves state of the art performance while being simple to implement and having minor additional computational cost over SGD. For problems where ALI-G is sensitive to its learning rate its robustness and performance can be enhanced by using BORAT which makes use of a greater bundle size. BORAT produces high accuracy models for a larger range of hyperparameters and is particularly eﬀective when using the multi-class hinge loss or in the presence of label noise. However, BORAT also achieves similar generalisation performance to ALI-G in all settings considered. Our results suggest that BORAT is presently the most robust single hyperparameter optimisation method for DNNs. It may be possible to modify BORAT so that it can make a parameter update after each gradient evaluation by cleverly selecting how linear approximations are added and removed from the bundle. If done correctly this may lead to a further increase in the speed of convergence. However we leave this to future work. When keeping the total number of gradient evaluations constant, a downside of using a larger bundle size for BORAT is that the number of parameter updates decreases. Despite this, in our experiments, BORAT is able to obtain good results within the same budget of passes through the data. Notably, this also means that BORAT has a time per epoch comparable to that of adaptive gradient methods. Another potential criticism of BORAT is that its memory footprint grows linearly with the bundle size. However in practice, our results show that using a bundle size of three, which corresponds to the same memory cost as ADAM, is often suﬃcient to obtain improved robustness. Finally, the applicability of BORAT can be limited by the required assumption of interpolation. While we argue that this interpolation property is satisﬁed in many interesting use cases, it may also not hold true for one or more of the following confounding factors: (i) limited size of the neural network, such as those used in embedded devices; (ii) large size of the training data set, which is becoming increasingly common in machine learning; and (iii) complexity of the loss function, such as in adversarial training. Furthermore, the concept of interpolation itself is ill-deﬁned for unsupervised tasks. Thus, another interesting direction of future work would be to generalise BORAT to non-interpolating settings. This could be achieved by adapting an estimate of the minimum value of the objective function online whilst retaining the desirable quality of minimal hyperparameter tuning.

A Stochastic Bundle Method for Interpolation

Appendix A. Dual Derivation

The dual of the primal problem (6), can be written as follows:

2α A t Atα + α bn t o .

Where At is deﬁned as a d N matrix, whose nth row is ℓzn t ( ˆwn t ), bn t = [b1 t , ..., b N t ]

and α = [α1, α2, . . . , αN] .

Proof : We start from the primal problem:

argmin w Rd

2η ||w wt||2 + max n [N]

ℓzn t ( ˆwn t ) (w wt) + bn t , (25)

We deﬁne w = w wt.

2η || w||2 + max n [N]

ℓzn t ( ˆwn t ) w + bn t ,

2η || w||2 + ζ subject to: ζ ℓzn t ( ˆwn t ) w + bn t , n [N],

min w Rd,ζ sup αn 0, n

( 1 2η || w||2 + ζ

n=1 αn(ζ ℓzn t ( ˆwn t ) w bn t )

sup αn 0, n min w Rd,ζ

( 1 2η || w||2 + ζ

n=1 αn(ζ ℓzn t ( ˆwn t ) w bn t )

, (strong duality)

sup αn 0, n min w Rd,ζ

( 1 2η || w||2 + ζ +

n=1 αn( ℓzn t ( ˆwn t ) w + bn t ζ)

The inner problem is now smooth in w and ζ. We write the KKT conditions:

n=1 αn ℓzn t ( ˆwn t ) = 0

We deﬁne N {α RN : PN n=1 αn = 1, αn 0, n = 1, ..., N} as probability simplex over the elements of α. Thus when we plug in the KKT conditions we obtain:

n=1 αn ℓzn t ( ˆwn t )||2 +

ℓzn t ( ˆwn t ) (η

m=1 αm ℓzm t ( ˆwm t )) + bn t

n=1 αn ℓzn t ( ˆwn t )||2 η

n=1 αn( ℓzn t ( ˆwn t ) N X

m=1 αm ℓzm t ( ˆwm t )) +

n=1 αn ℓzn t ( ˆwn t )||2 η||

n=1 αn ℓzn t ( ˆwn t )||2 +

Paren, Berrada, Poudel and Kumar

n=1 αn ℓzn t ( ˆwn t )||2 +

Using the deﬁnitions of At, bn t and α, we can recover the following compact form for the dual problem:

2α A t Atα + α bn t o .

Appendix B. Proof of Proposition 1

Proposition 1

Let F : RN R be a concave function. Let us deﬁne α = argmaxα F(α). Then there exists c R such that:

n [N] such that αn > 0, we have: F(α)

α=α = c. (26)

In other words, the value of the partial derivative is shared among all coordinates of α that are non-zero.

Proof : We start form the optimally condition for constrained problems:

F(α ), α α 0, α (27)

We consider the point ˆα. Without loss of generality we assume ˆα is equal to α for all but two dimensions i and j. We let ˆαi = 0 and ˆαj = α i + α j. Hence we know ˆα as we have P

i α i = 1 and ˆαi 0, i. Letting α = ˆα in (27) give us:

αi α i + F(α )

αj α i 0. (28)

Rearranging we have:

αj α i F(α )

αi α i . (29)

Hence for any α i = 0, we have:

Noticing, this results holds for any i and j gives us the following and proves the result.

αi = c (31)

A Stochastic Bundle Method for Interpolation

Appendix C. Proof of Proposition 3

Proposition 3

Algorithm 1 returns a solution α that satisﬁes α argmaxα N D(α). This is true even when a the dual does not have a unique solution.

Proof : let x argmaxα N D(α) such that x has a maximal number of zero coordinates. x exists as we know the solution set is empty. Let I be the set of non-zero coordinates of x . We denote by S(I) the set of soultions to the linear system associated with I:

S(I) n x R|I| : Qx = b, x 0 o , where, Q Q[I I] 1 1 0

and, b = b[I] 1

S(I) is a polytope as the intersection intersection between the probability simplex and a linear sub-space. There for it admits and vertex representation:

S(I) = Conv(V(I)) (33)

such that Conv( ) denotes the convex hull operation, and V(I) is a ﬁnite set. Since S(I)

contains x [I], S(I) is non-empty and neither is V(I). let v be and element of V(I). Note, that if v had one or more zero coordinates, it would be a solution after lifting to RN it would also be a optimal solution to the dual as good as x while having more zero coordinates than x*, this is impossible by the deﬁnition of x . Thus we can conclude v has exclusively non-zero coordinates. Since v is a external point of S(I), see section 2.1 of Bertsekas (2009) for a deﬁnition. It follows from proposition 2.1.4b of (Bertsekas, 2009) the columns of Q are independent. Therefore the linear system admin a unique solution x , which is found by Algorithm 1.

Appendix D. Convex Results

Adding additional linear approximations to the bundle of BORAT can never result in a lower maximal dual value. Formally:

Dm(α ) Dl(α ) m > l. (34)

Proof : Any vector of l can be lifted to m by appending m l zeros to it, which does not impact the value of the objective function. The lifted set l is then a subset of m, hence the result (maximisation performed performed over a larger space).

Theorem 3 (Convex)

We assume that Ωis a convex set, and that for every z Z, ℓz is convex and. Let w be a solution of (P), and assume that we have perfect interpolation: z Z, ℓz(w ) = 0. Then BORAT for N 2 applied to f satisﬁes: wt+1 w 2 wt w 2 2η max α N D(α), (35)

where D is deﬁned in (7).

Paren, Berrada, Poudel and Kumar

wt+1 w 2 ΠΩ(wt ηAtαt) w 2, (36)

wt ηAtαt w 2, (ΠΩprojection) (37)

wt w 2 + η2 Atαt 2 2η Atαt, wt w , (expanding) (38)

wt+1 w 2 wt w 2 η2 Atαt 2 2η Atαt, wt w , (rearranging) (39)

= η2 Atαt 2 2η Atαt, wt w (40)

= η2 Atαt 2 2η Atαt, wt ˆwn t 2η Atαt, ˆwn t w , (41)

= η2 Atαt 2 2η Atαt, wt ˆwn t 2η

n=1 αn ℓzt( ˆwn t ) ( ˆwn t w ), (AtαN t = 0) (42)

η2 Atαt 2 2η Atαt, wt ˆwn t 2η

n=1 αn t (ℓzt( ˆwn t ) ℓzt(w )), (convexity) (43)

η2 Atαt 2 2η

n=1 αn ℓzt( ˆwn t ) (wt ˆwn t ) 2η

n=1 αn t (ℓzt( ˆwn t ) ℓzt(w )), (44)

η2 Atαt 2 2η

n=1 αn[ℓzt( ˆwn t ) ℓzt( ˆwn t ) ( ˆwn t wt)] + 2η

n=1 αn t ℓzt(w ), (45)

η2 Atαt 2 2ηαtbt 2η

n=1 αn t ℓzt(w ), (bt deﬁnition) (46)

2ηD(αt) + 2η

n=1 αn t ℓzt(w ), (D deﬁnition) (47)

2ηD(αt) + 2η(1 αN t )ℓzt(w ), (

n=1 αn t = 1) (48)

2ηD(αt), (ℓzt(w ) = 0, perfect interpolation) (49)

2η max α N D(α) (αtdeﬁnition) (50)

A consequence of Lemma 2 is that the convergence rate given by Theorem 5 improves as N increases.

D.1 Convex and C-Lipshits

Theorem 1 (Convex and Lipschitz)

We assume that Ωis a convex set, and that for every z Z, ℓz is convex and CLipschitz. Let w be a solution of (P), and assume that we have perfect interpolation: z Z, ℓz(w ) = 0. Then BORAT for N 2 applied to f satisﬁes:

(T + 1) + w0 w 2

η(T + 1) . (51)

A Stochastic Bundle Method for Interpolation

Proof : We start from (50), hence we have:

wt+1 w 2 wt w 2 2ηD(αt) (52)

From Lemma 2 we additionally have that D2(α ) DN>2(α ) hence consider the N = 2 as this provides the worse rate.

wt+1 w 2 wt w 2 2ηD2(αt) (53)

For N = 2 we have exactly two sub problems, and hence can write the dual in following compact form:

2 gzt 2 + ℓzt(wt), if η gzt 2 ℓzt(wt)

1 2η ℓzt(wt)2

gzt 2 if η gzt 2 ℓzt(wt) (54)

We now introduce IT and JT as follows:

IT t {0, ..., T} : η gzt 2 ℓzt(wt)

JT {0, ..., T} \ IT

Deﬁning 1 to be the indicator function we can write:

wt+1 w 2 wt w 2 + 1(t IT )η η gzt 2 2ℓzt(wt) 1(t JT ) ℓzt(wt)2

From our deﬁnition of IT for all t IT we have η gzt 2 ℓzt(wt) hence we can write:

wt+1 w 2 wt w 2 1(t IT )ηℓzt(wt) 1(t JT ) ℓzt(wt)2

Summing over T:

w T +1 w 2 w0 w 2 η X

t IT ℓzt(wt) X

Using w T +1 w 2 0, we obtain:

t IT ℓzt(wt) + X

w0 w 2. (58)

From ℓzt(wt)2

gzt 2 0, we get:

t IT ℓzt(wt) 1

η w0 w 2. (59)

Likewise, using the observation that ℓz 0, we can write:

gzt 2 w0 w 2. (60)

Paren, Berrada, Poudel and Kumar

Diving by C2: X

t JT ℓzt(wt)2 C2 w0 w 2. (61)

Using the Cauchy-Schwarz inequality, we can further write: X

t JT ℓzt(wt)

t JT ℓzt(wt)2. (62)

Therefore we have: X

t JT ℓzt(wt) C p

|JT | w0 w 2. (63)

We can now put together inequalities (59) and (63) by writing:

t=0 ℓzt(wt) = X

t JT ℓzt(wt) + X

t IT ℓzt(wt) (64)

t=0 ℓzt(wt) 1

η w0 w 2 + C p

|JT | w0 w 2 (65)

t=0 ℓzt(wt) 1

η w0 w 2 + C p

(T + 1) w0 w 2 (66)

Dividing by T + 1 and taking the expectation over zt, we ﬁnally get:

t=0 f(wt) f , (f is convex) (67)

(T + 1) + w0 w 2

η(T + 1) . (68)

D.2 Convex and Smooth Lemma 3

Let z Z. Assume that ℓz is β-smooth and non-negative on Rd. Then we have:

(w) Rd, ℓz(w) 1

Note that we do not assume that ℓz is convex.

Proof : Let w Rd. By Lemma 3.4 of Bubeck (2015), we have:

u Rd, |ℓz(u) ℓz(w) ℓz(w) (u w)| β

Therefore we can write:

u Rd, ℓz(u) ℓz(w) + ℓz(w) (u w)| + β

A Stochastic Bundle Method for Interpolation

And since u, lz(u) 0, we have:

u Rd, 0 ℓz(w) + ℓz(w) (u w)| + β

We now choose u = 1

β lz(w), which yeilds:

u Rd, 0 ℓz(w) 1

β ℓz(w) 2 + β

which gives the desired result.

Theorem 3 (Convex and Smooth - Large Eta)

We assume that Ωis a convex set, and that for every z Z, ℓz is convex and βsmooth. Let w be a solution of (P), and assume that we have prefect interpolation: z Z, ℓz(w ) = 0. Then BORAT for N 2 applied to f with η 1 2β satisﬁes:

f 2β w0 w 2

T + 1 . (69)

Proof : We start from Equation (57) we have:

w T +1 w 2 w0 w 2 η X

t IT ℓzt(wt) X

Now using Lemma 3 we can write:

w T +1 w 2 w0 w 2 η X

t IT ℓzt(wt) 1

t JT ℓzt(wt). (71)

From our assumption on η 1 2β we can write:

w T +1 w 2 w0 w 2 1

t IT ℓzt(wt) 1

t JT ℓzt(wt), (72)

w T +1 w 2 w0 w 2 1

t=0 ℓzt(wt). (73)

using w T +1 w 2 0, we obtain:

t=0 ℓzt(wt) 2β w0 w 2. (74)

Dividing by T + 1 and taking the expectation over zt, we ﬁnally get:

t=0 f(wt) f , (f is convex)

f 2β w0 w 2

Paren, Berrada, Poudel and Kumar

Theorem 4 (Convex and Smooth - Small Eta)

We assume that Ωis a convex set, and that for every z Z, ℓz is convex and βsmooth. Let w be a solution of (P), and assume that we have prefect interpolation: z Z, ℓz(w ) = 0. Then BORAT for N 2 applied to f with η 1 2β satisﬁes:

f 2β w0 w 2

T + 1 . (75)

Proof : now considering the η 1 2β starting from (76)

w T +1 w 2 w0 w 2 η X

t IT ℓzt(wt) 1

t JT ℓzt(wt), (76)

w T +1 w 2 w0 w 2 η X

t IT ℓzt(wt) η X

t JT ℓzt(wt), (77)

w T +1 w 2 w0 w 2 η

t=0 ℓzt(wt). (78)

using w T +1 w 2 0, we obtain:

t=0 ℓzt(wt) 1

η w0 w 2. (79)

Dividing by T + 1 and taking the expectation over zt, we ﬁnally get:

t=0 f(wt) f , (f is convex)

D.3 Strongly Convex

Finally, we consider the α-strongly convex and β-smooth case. Lemma 4

Let z Z. Assume that ℓz is α-strongly convex, non-negative on Rd, and such that inf ℓz = 0. Then we have:

w Rd, ℓz(w) ℓz(w) 2 1

Proof : Let w Rd and suppose that ℓz reaches its minimum at w Rd (this minimum exists because of strong convexity). By deﬁnition of strong convexity, we have that:

ˆw Rd, ℓz( ˆw) ℓz(w) + ℓz(w) ( ˆw w) + α

2 ˆw w 2 (81)

A Stochastic Bundle Method for Interpolation

We minimize the right hand-side over ˆw, which gives:

ˆw Rd, ℓz( ˆw) ℓz(w) + ℓz(w) ( ˆw w) + α

2α ℓz(w) 2 (82)

Thus by choosing ˆw = w and re-ordering, we obtain the following result (a.k.a. the Polyak Lojasiewicz inequality):

ℓz(w) ℓz(w) 1

2α ℓz(w) 2 (83)

However we have ℓz(w) = 0 which concludes the proof.

For any a, b Rd, we have that:

a 2 + b 2 1

2 a b 2. (84)

Proof : This is a simple application of the parallelogram law, but we give the proof here for completeness.

a 2 + b 2 1

2 a b 2 = a 2 + b 2 1

2 b 2 + a b

2 b 2 + a b

Let z Z. Assume that ℓz is α-strongly convex and achieves its (possibly constrained) minimum at w Ω. Then we have:

w Ω, ℓz(w) ℓz(w ) α

2 w w 2 (85)

Proof : By deﬁnition of strong-convexity Bubeck (2015), we have:

w Ω, ℓz(w) ℓz(w ) ℓz(w ) (w w ) α

2 w w 2. (86)

In addition, since w minimizes ℓz, then necessarily:

w Ω, ℓz(w ) (w w ) 0. (87)

Combining the two equations gives the desired result.

Finally, we consider the α-strongly convex and β-smooth case. Again, our proof yields a natural separation betweenη 1 2β and η 1 2β.

Paren, Berrada, Poudel and Kumar

Theorem 5 (Strongly Convex - Large Eta)

We assume that Ωis a convex set, and that for every z Z, ℓz is α-strongly convex and β-smooth. Let w be a solution of (P), and assume that we have prefect interpolation: z Z, ℓz(w ) = 0. Then BORAT for N 2 and applied to f with a η 1 2βsatisﬁes:

E[f(wt+1)] f(w ) β

w0 w 2. (88)

Proof : We start from (56):

wt+1 w 2 wt w 2 1(t IT )ηℓzt(wt) 1(t JT ) ℓzt(wt)2

We next use Lemma 3 write:

wt+1 w 2 wt w 2 1(t IT )ηℓzt(wt) 1(t JT ) 1

ellzt(wt). (91)

Now we use our assumption on η to give:

wt+1 w 2 wt w 2 1

2β ℓzt(wt) (92)

taking expectations:

E[ wt+1 w 2] wt w 2 1

2β f(wt) (93)

Using Lemma 6

E[ wt+1 w 2] wt w 2 α

4β wt w 2 (94)

We use a trivial induction over t and write:

E[ wt+1 w 2] 1 α

wt w 2, (95)

E[ wt+1 w 2] 1 α

t w0 w 2, (96)

Given an arbitrary w Rd, we now wish to relate the distance w w 2 to the function values f(w) f(w ). From smoothness, we have:

f(wt + 1) f(w ) f(w ) (wt + 1 w ) + β

2 wt + 1 w 2. (98)

However we know by deﬁnition f(w ) = 0 hence:

f(wt+1) f(w ) β

2 wt+1 w 2. (99)

A Stochastic Bundle Method for Interpolation

Taking expectations:

E[f(wt+1)] f(w ) β

2 E[ wt+1 w 2]. (100)

Hence we recover the ﬁnal rate:

E[f(wt+1)] f(w ) β

t w0 w 2, (101)

E[f(wt+1)] f(w ) β

w0 w 2. (102)

Theorem 6 (Strongly Convex - Small Eta)

We assume that Ωis a convex set, and that for every z Z, ℓz is α-strongly convex and β-smooth. Let w be a solution of (P), and assume that we have prefect interpolation: z Z, ℓz(w ) = 0. Then BORAT for N 2 and applied to f with a η 1 2βsatisﬁes:

E[f(wt+1)] f(w ) β

w0 w 2. (103)

Proof : We start from (56):

wt+1 w 2 wt w 2 1(t IT )ηℓzt(wt) 1(t JT ) ℓzt(wt)2

We next use Lemma 3 write:

wt+1 w 2 wt w 2 1(t IT )ηℓzt(wt) 1(t JT ) 1

2β ℓzt(wt). (105)

Now we use our assumption on η to give:

wt+1 w 2 wt w 2 ηℓzt(wt) (106)

taking expectations:

E[ wt+1 w 2] wt w 2 ηℓzt(wt) (107)

Using Lemma 6

E[ wt+1 w 2] wt w 2 αη

2 wt w 2 (108)

We use a trivial induction over t and write:

E[ wt+1 w 2] 1 αη

wt w 2, (109)

E[ wt+1 w 2] 1 αη

t w0 w 2, (110)

Given an arbitrary w Rd, we now wish to relate the distance w w 2 to the function values f(w) f(w ). From smoothness, we have:

f(wt + 1) f(w ) f(w ) (wt + 1 w ) + β

2 wt + 1 w 2. (112)

Paren, Berrada, Poudel and Kumar

However we know by deﬁnition f(w ) = 0 hence:

f(wt+1) f(w ) β

2 wt+1 w 2. (113)

Taking expectations:

E[f(wt+1)] f(w ) β

2 E[ wt+1 w 2]. (114)

Hence we recover the ﬁnal rate:

E[f(wt+1)] f(w ) β

t w0 w 2, (115)

E[f(wt+1)] f(w ) β

w0 w 2. (116)

Appendix E. None Convex Results

Here we provide the proof of theorem 2, which we restate for clarity. To simplify our analysis, we consider the BORAT algorithm with N = 3. We prove these result for BORAT with the minor modiﬁcation, that is, all linear approximations are formed using the same mini-batch of data, ℓzn t = ℓzt for all n {2, ..., N 1}.

Theorem 2 (RSI )

We consider problems of type (1), see main paper. We assume lz satisﬁes RSI with constant µ, smoothness constant β and perfect interpolation e.g. lz(w ) = 0, z Z. Then if set η ˆη = min{ 1

4µ, µ 2β2 } then in the worst case we have:

f(w T+1) f exp 3

8 ˆηµ T ||w0 w ||2. (117)

In order to derive the proof for Theorem 2 we ﬁrst give a brief overview of BORAT with N = 3. We detail the (2N 1) possible subproblems (7 in this case), and the resulting values of αt for each. We show for η 1 2β, only a sub-set of the subproblems result in valid solutions with optimal points within the simplex. Finally we derive a rate assuming that for the each of the remaining subproblems is optimal for all t. Lastly by taking the minimum of these rates we construct the worst case rate.

E.0.1 BORAT (N = 3)

With (N = 3) the bundle is made of three linear approximations. These are, the lower bound on the loss and linear approximations made at the current point and at the optimum of the bundle of size two. Hence this third linear approximation is made at the location one would reach after taking a ALI-G step. Note, here we use γt to denote the optimal value of α1 as given by 12. As some of this proofs get quite involved, we make use of the following abbreviations to save space:

ˆw1 t = wt, gzt = ℓzn t (wt), b1 t = ℓzn t (wt),

A Stochastic Bundle Method for Interpolation

ˆw2 t = w t, g zt = ℓzn t (w t), b2 t = ℓzn t (w t) + ηγt gzt, g zt .

Where w t is deﬁned as w t = wt ηγt ℓzn t (wt), where γt = min{1, ℓzn t (wt)

η ℓzn t (wt) 2 η. With this

notation deﬁned, the BORAT primal problem for this special case can be simpliﬁed to:

wt+1 = argmin w Rd

2η||w wt||2 + max gzt, w wt + b1 t , g zt, w wt + b2 t , 0 . (118)

The dual of (118) be written an:

αt = argmax α 3

2||α1gzt + α2tg zt||2 + α1ℓzn t (wt) + α2ℓzn t (w t) + α2ηγt gzt, g zt o . (119)

For N = 3 we have the following parameter update:

wt+1 = wt α1tη ℓzn t (wt) α2tη ℓzn t (w t), (120)

E.0.2 subproblems

Algorithm 1 solves (2N 1) sub problem. Each of these linear systems corresponds to a loci of the simplex, deﬁning the feasible set. We refer to each of these (2N 1) options as subproblems. However, each subproblem can also be interpreted as a sub-system of Qα = b, (see line 2 of Algorithm 1 for deﬁnitions). For (N = 3) the form of Qα = b is detailed in (121).

η||gzt||2 η gzt, g zt 0 1 η gzt, g zt η||g zt||2 0 1 0 0 0 1 1 1 1 0

where c is deﬁned in Theorem 1. Each subproblem is deﬁned by setting a unique subset of the dual variables αn to zero, before solving for the remaining variables. For (N = 3) we have exactly seven subproblems, which we each give a unique label, see table 5. For clarity, we detail a few speciﬁc subproblems. The SGD subproblem corresponds to setting α2, α3 = 0, and recovers the SGD update. Likewise, the ESGD step corresponds to setting α1, α3 = 0 and recovers an update similar to the extra gradient method. Finally, by setting α2 = 0 before solving the system we recover the a ALI-G like step, hence we label this subproblem as the ALI-G step. If a subproblem results in a α 3 we refer to that subproblem as feasible, conversely, if it results in a α / 3 we refer to that subproblem as infeasible. Algorithm 2 computes the dual value for all feasible subproblems and selects the largest. This subproblem s α is then used in (120) to update the parameters. The closed form solutions for α for each of the 7 subproblems are listed in table 5. We use a subscript to show that α belongs to a certain subproblem. For example αSGD = [1, 0, 0].

Paren, Berrada, Poudel and Kumar

α1 α2 α3 label

1 0 0 SGD 0 1 0 SEGD 0 0 1 ZERO 0 b2 η||g zt||2 1 b2 η||g zt||2 EALIG

b1 η||gzt||2 0 1 b1 η||gzt||2 ALIG

η||g zt||2 2η gzt,g zt +b1 b2 η||gzt g zt||2 η||gzt||2+b2 b1

η||gzt g zt||2 0 MAX2

b1||g zt||2 b2g ztg zt η||gzt||2||g zt||2 η||gztg zt||2 b2||gzt||2 b1g zta2 η||gzt||2||g zt||2 η||gztg zt||2 1 α1 α2 MAX3

Table 5: subproblems for N = 3.

E.0.3 Dual Values

The corresponding expressions for the dual values for the seven diﬀerent subproblems are detailed below:

DZERO(α) = 0,

DSGD(α) = η

2||gzt||2 + ℓzn t (wt),

DESGD(α) = η

2||g zt||2 + ℓzn t (w t) + ηγt gzt, g zt ,

DALIG(α) = 1

2η ℓzn t (wt)2

DEALIG(α) = 1

ℓzn t (w t) + ηγt gzt, g zt 2

||g zt||2 ,

DMAX2(α) = 1 2η||gzt g zt||2

(ℓzn t (w t) ℓzn t (wt))2 + 2η ℓzn t (w t)||gzt||2 + ℓzn t (wt)||g zt||2

4ηℓzn t (wt) gzt, g zt + 2η2||gzt||2 gzt, g zt η2||gzt||2||g zt||2 ,

DMAX3(α) = 1

ℓzn t (w t)gzt + ηγt gzt, g zt gzt ℓzn t (wt)g zt 2

η||gzt||2||g zt||2 η gzt, g zt 2 .

Due to spatial constraints we state these results without proof. The dual value for each subproblem is simply derived by inserting the closed form expression for α for each subproblem detailed in table 5 into (7). These dual values observe a tree like hierarchy,

DMAX3 DMAX2, DALIG, DEALIG,

DMAX2 DSGD, DESGD,

DALIG DSGD, DZERO,

DEALIG DESGD, DZERO.

A Stochastic Bundle Method for Interpolation

This hierarchy is a consequence of the subproblems maximising the dual over progressively larger spaces, n 1 n.

E.0.4 Feasible Subproblems

To give a worst case rate on the convergence of BORAT with N = 3, we ﬁrst prove for smooth functions and small η, only certain subproblems will be feasible. We start with a useful lemma, before proving this result. Lemma 7

Let z Z. Assume that ℓz is β-smooth. If we deﬁne w = w η ℓz(w) and η 1

β then we have:

(w) Rd, ℓz(w), ℓz(w ) 0.

Note that we do not assume that ℓz is convex.

2 ℓz(w), ℓz(w ) = || ℓz(w) ℓz(w )||2 + || ℓz(w)||2 + || ℓz(w )||2

2 ℓz(w), ℓz(w ) β2||w w ||2 + || ℓz(w)||2 + || ℓz(w )||2, (smoothness)

2 ℓz(w), ℓz(w ) β2η2|| ℓz(w)||2 + || ℓz(w)||2 + || ℓz(w )||2, (w deﬁnition)

ℓz(w), ℓz(w ) 1

2(1 β2η2)|| ℓz(w)||2 + 1

2|| ℓz(w )||2,

ℓz(w), ℓz(w ) 0. ( 1

Lemma 8 (Feasible Subproblems)

If ℓzn t has smoothness constant β and we set η 1 2β for the BORAT algorithm with (N = 3) detailed in Section E.0.1, the ALIG, EALIG and MAX3 subproblems will always be infeasible.

Proof : We start by showing the ALIG step is infeasible. From Lemma 3 we have:

ℓz(w) ||gzt||2 1

For the ALI-G step to be feasible we require α3 ALIG > 0 or 1 ℓzn t (wt) η||gzt||2 > 0. Rearranging, we have:

η > ℓzn t (wt) ||gzt||2 .

Combining these two inequalities gives:

Hence for any 1 2β η, η > ℓzn t (wt) ||gzt||2 cannot hold. We now use a similar argument to show

that the EALIG subproblem is infeasible. For EALIG to be feasible we require α3t EALIG > 0,

Paren, Berrada, Poudel and Kumar

plugging in the closed form solution for α3 EALIG gives:

η > ℓzn t (w t) + ηγt gzt, g zt ||g zt||2 = ℓzn t (w t) + η gzt, g zt ||g zt||2 = ℓzn t (w t) ||g zt||2 + η gzt, g zt ||g zt||2 ℓzn t (w t) ||g zt||2 1

where the penultimate inequality makes use of gzt, g zt 0, which is a direct application of Lemma 7. The ﬁnal inequality is a direct application of Lemma 3. Again, it is clear that the

condition η > ℓzn t (w t) ||g zt||2 cannot be satisﬁed for η 1 2β .

We show that the MAX3 step is never taken for η 1 2β . First, we show that the dual value for the MAX3 step can be written as DMAX3(α) = 1

2 ℓzn t (wt)α1t + ℓzn t (w t)α2t + ηγtα2t gzt, g zt . We start from the dual value stated in Section E.0.3.

DMAX3(α) = 1

ℓzn t (w t)gzt + ηγt gzt, g zt gzt ℓzn t (wt)g zt 2

η||gzt||2||g zt||2 η gzt, g zt 2 ,

DMAX3(α) = 1

2ℓzn t (w t)

ℓzn t (w t)gzt + ηγt gzt, g zt gzt ℓzn t (wt)g zt

η||gzt||2||g zt||2 η gzt, g zt 2 gzt | {z } =α2

2ηγt gzt, g zt

ℓzn t (w t)gzt + ηγt gzt, g zt gzt ℓzn t (wt)g zt

η||gzt||2||g zt||2 η gzt, g zt 2 gzt | {z } =α2

2ℓzn t (wt)

ℓzn t (w t)gzt + ηγt gzt, g zt gzt ℓzn t (wt)g zt

η||gzt||2||g zt||2 η gzt, g zt 2 g zt. | {z } = α1

Using the deﬁnitions of α1 MAX3 and α2 MAX3 we recover the following expression for the MAX3 subproblem s dual function:

DMAX3(α) = 1

2 ℓzn t (wt)α1t + ℓzn t (w t)α2t + ηγtα2t gzt, g zt .

With the dual function in this form it is easy to upper bound the feasible dual value for the MAX3 subproblem as DMAX3 1

2 max ℓzn t (wt), ℓzn t (w t) + ηγt gzt, g zt . This is a consequence of the fact that α must hold for feasible steps. However, from the hierarchy of dual values we have the lower bounds DMAX3 DSGD and DMAX3 DESGD, on the MAX3 dual value, see Section E.0.3. If either of these lower bounds have larger value than the feasible dual value s upper bound, the MAX3 step will not be feasible. We now show that this is always the case for η 1 2β . In order to do this we consider two cases.

We ﬁrst assume ℓzn t (wt) ℓzn t (w t) + ηγt gzt, g zt . Hence we know the maximum feasible value for DMAX3 = 1

2ℓzn t (wt), if either DSGD or DESGD have larger dual value we can conclude the MAX3 step is unfeasible.

DSGD(α) = η

2||gzt||2 + ℓzn t (wt),

Hence if the following condition holds we know the MAX3 step will be unfeasible:

1 2ℓzn t (wt) η

2||gzt||2 + ℓzn t (wt).

A Stochastic Bundle Method for Interpolation

Thus, the converse must hold for the MAX3 step to be feasible:

1 2ℓzn t (wt) η

2||gzt||2 + ℓzn t (wt),

which is equivalent to,

η ℓzn t (wt) ||gzt||2 .

Using the same logic as stated for the ALI-G step we know this condition is never satisﬁed for η 1 2β .

We now assume ℓzn t (wt) ℓzn t (w t) + ηγt gzt, g zt and thus we know the max feasible value of DMAX3 1

2ℓzn t (w t) + 1

2ηγt gzt, g zt , again if either DSGD or DESGD have larger values, we know the MAX3 subproblem is unfeasible:

DESGD(α) = η

2||g zt||2 + ℓzn t (w t) + ηγt gzt, g zt .

Hence for the MAX3 step to be valid we must have:

1 2ℓzn t (w t) + ηγt gzt, g zt η

2||g zt||2 + ℓzn t (w t) + ηγt gzt, g zt ,

which is equivalent to,

η ℓzn t (w t) + ηγt gzt, g zt ||g zt||2 .

Again, we have the same condition as for the EALIG step, which we have already proven can never be feasible for η 1 2β . Hence the MAX3 subproblem is never feasible for η 1 2β .

For any set of vectors a, b, c then, the following inequality holds:

2||a b||2 ||a c||2 + 2||b c||2.

Proof : First consider two vectors x and y.

0 ||x y||2,

0 ||x||2 + ||y||2 2 x, y

2 x, y ||x||2 + ||y||2,

||x + y||2 = ||x||2 + ||y||2 + 2 x, y ,

||x + y||2 2||x||2 + 2||y||2,

2||x||2 2||y||2 ||x + y||2.

Setting x = a b and y = b c gives the desired result.

Paren, Berrada, Poudel and Kumar

Let z Z. Assume that ℓz is β-smooth and non-negative on Rd. Then we have:

(w) Rd, ℓz(w) 1

2β || ℓz(w)||2

Note that we do not assume that ℓz is convex.

Proof : Let w Rd. By Lemma 3.4 of Bubeck (2015), we have:

u Rd, |ℓz(u) ℓz(w) ℓz(w) (u w)| β

2 ||u w||2.

Therefore we can write:

u Rd, ℓz(u) ℓz(w) + ℓz(w) (u w)| + β

2 ||u w||2.

And since u, lz(u) 0, we have:

u Rd, 0 ℓz(w) + ℓz(w) (u w)| + β

2 ||u w||2.

We now choose u = 1

β lz(w), which yeilds:

u Rd, 0 ℓz(w) 1

β || ℓz(w)||2 + β

2 || ℓz(w)||2,

which gives the desired result.

In this section we derive the rate for each of the remaining feasible steps, that is, SGD, ESGD and MAX2.

E.1 SGD Subproblem

We assume that Ω= Rd, for every z Z, ℓz(w) is β and satisﬁes the RSI condition with constant µ. Let w be a solution of f(w). We assume z Z, ℓzn t (w ) = 0. Then, if we apply BORAT with η ˆη = min{ 1

β2 } and we take the step resulting from the SGD subproblem for all t we have:

E[||wt+1 w ||2] (1 ˆηµ)||wt w ||2.

||wt+1 w ||2 ||ΠΩ(wt ηg zt) w ||2, (122)

||wt ηgzt w ||2, (123)

= ||wt w ||2 + η2||gzt||2 2η gzt, wt w , (124)

||wt w ||2 + η2||gzt||2 2ηµ||wt w ||2. (125)

A Stochastic Bundle Method for Interpolation

We have ||gzt||2 2βℓzn t (wt) from Lemma 3 and ℓzn t (wt) β

2 ||wt w ||2 from smoothness giving ||gzt||2 β2||wt w ||2. We can now upper bound the r.h.s producing:

||wt+1 w ||2 ||wt w ||2 + η2β2||wt w ||2 2ηµ||wt w ||2, (126)

||wt+1 w ||2 (1 2ηµ + η2β2)||wt w ||2, (127)

||wt+1 w ||2 1 η(2µ ηβ2) ||wt w ||2. (128)

Now, if we select η ˆη = min{ 1

β2 } in the worst case we get:

||wt+1 w ||2 (1 ˆηµ)||wt w ||2. (129)

Taking expectations with respect to zt:

E[||wt+1 w ||2] E[(1 ˆηµ)||wt w ||2]. (130)

Noting that wt does not depend on zt, and neither does w due to the interpolation property:

E[||wt+1 w ||2] (1 ˆηµ)||wt w ||2. (131)

E.2 ESGD Subproblem

Let z Z. We assume that ℓz is β-smooth and non-negative on Rd. Additionally we

assume that η 1 2β. If we deﬁne γt .= min{1, ℓzn t (wt)

η||gzt||2 }, then we have:

Proof : From our assumption on η and Lemma 3 we have:

2β ℓzn t (wt) ||gzt||2 .

Rearranging gives:

1 ℓzn t (wt) η||gzt||2 .

Plugging in to the the deﬁnition of γt, gives the desired result.

We assume that Ω= Rd, for every z Z, ℓz(w) is β and satisﬁes the RSI condition with constant µ. Let w be a solution of f(w). We assume z Z, ℓzn t (w ) = 0. Then, if we apply BORAT with η ˆη = min{ 1

β2 } and we take the step resulting from the ESGD subproblem for all t we have:

E[||wt+1 w ||2] (1 ˆηµ)||wt w ||2.

Paren, Berrada, Poudel and Kumar

Proof : This proof loosely follows work by Vaswani et al. (2019a).

||wt+1 w ||2 ||ΠΩ(wt ηg zt) w ||2, (132)

||wt ηg zt w ||2, (133)

= ||wt w ||2 + η2||g zt||2 2η g zt, wt w , (134)

= ||wt w ||2 + η2||g zt||2 2η g zt, w t + ηγtgzt w , (135)

= ||wt w ||2 + η2||g zt||2 2η g zt, w t + ηgzt w , (lemma 12) (136)

= ||wt w ||2 + η2||g zt||2 2η g zt, w t w 2η2 g zt, gzt . (137)

Using the RSI condition:

||wt+1 w ||2 ||wt w ||2 + η2||g zt||2 2ηµ||w t w ||2 2η2 g zt, gzt . (138)

Using lemma (9) to upper bound ||w t w ||2,

= ||wt w ||2 + η2||g zt||2 ηµ||w wt||2 + 2ηµ||wt w t||2 2η2 g zt, gzt , (139)

= (1 ηµ)||wt w ||2 + η2||g zt||2 + 2ηµ||wt w t||2 2η2 g zt, gzt , (140)

= (1 ηµ)||wt w ||2 + η2||g zt gzt||2 η2||gzt||2 + 2ηµ||wt w t||2, (141)

= (1 ηµ)||wt w ||2 + η2||g zt gzt||2 η2||gzt||2 + 2η3µ||gzt||2, (142)

(1 ηµ)||wt w ||2 + η2β2||w t wt||2 η2||gzt||2 + 2η3µ||gzt||2, (smoothness) (143)

= (1 ηµ)||wt w ||2 + η4β2||gzt||2 η2||gzt||2 + 2η3µ||gzt||2, (144)

= (1 ηµ)||wt w ||2 + η2 η2β2 1 + 2ηµ ||gzt||2, (145)

Taking expectations with respect to zt:

E[||wt+1 w ||2] = E[(1 ηµ)||wt w ||2 + η2 η2β2 1 + 2ηµ ||gzt||2, (146)

Noting that wt does not depend on zt, and neither does w due to the interpolation property:

E[||wt+1 w ||2] = (1 ηµ)||wt w ||2 + η2 η2β2 1 + 2ηµ E[||gzt||2], (147)

If we set η ˆη = min{ 1

β2 } then we have η2β2 1

2,hence η2β2 1 + 2ηµ 0.

E[||wt+1 w ||2] (1 ηµ)||wt w ||2. (148)

Hence if we insert the chosen value for η then we have:

E[||wt+1 w ||2] (1 ˆηµ)||wt w ||2. (149)

E.3 MAX2 Subproblem

We assume that Ω= Rd, for every z Z, lz(w) is β and satisﬁes the RSI condition with constant µ. Let w be a solution of f(w). We assume z Z, lzt(w ) = 0. Then, if we apply BORAT with η ˆη = min{ 1

β2 } and we take the step resulting from the

A Stochastic Bundle Method for Interpolation

MAX2 subproblem for all t we have:

E[||wt+1 w ||2] 1 3

8 ˆηµ t ||w0 w ||2.

Proof : Note we assume γt = 1 as proved in Lemma 12.

||wt+1 w ||2 ||ΠΩ(wt ηα1tgzt ηα2tg zt) w ||2, (150)

||wt ηα1tgzt ηα2tg zt w ||2, (151)

= ||wt w ||2 + η2||α1tgzt + α2tg zt||2

2η α1tgzt + α2tg zt, wt w , (152)

= ||wt w ||2 + η2||α1tgzt + α2tg zt||2

2ηα1t gzt, wt w 2ηα2t g zt, wt w , (153)

= ||wt w ||2 + η2||α1tgzt + α2tg zt||2 2ηα1t gzt, wt w

2ηα2t g zt, w t + ηgzt w , (154)

= ||wt w ||2 + η2||α1tgzt + α2tg zt||2 2ηα1t gzt, wt w

2ηα2t g zt, w t w 2η2α2t g zt, gzt , (155)

= ||wt w ||2 + η2||α1tgzt + α2tg zt||2 2η2α2t g zt, gzt

2ηα1t gzt, wt w 2ηα2t g zt, w t w . (156)

We now make use of gzt, wt w µ||w wt||2 (RSI condition),

||wt+1 w ||2 ||wt w ||2 + η2||α1tgzt + α2tg zt||2

2η2α2t g zt, gzt 2ηα1tµ||wt w ||2 2η α2tg zt, w t w , (157)

||wt+1 w ||2 (1 2ηα1tµ)||wt w ||2 + η2||α1tgzt + α2tg zt||2

2η2α2t g zt, gzt 2ηα2t g zt, w t w . (158)

Similarly using g zt, w t w µ||w w t||2 (RSI condition),

||wt+1 w ||2 (1 2ηα1tµ)||wt w ||2 + η2||α1tgzt + α2tg zt||2

2η2α2t g zt, gzt 2ηα2tµ||w t w ||2. (159)

We now upper bound ||w t w ||2, using lemma (9):

||wt+1 w ||2 (1 2ηα1tµ)||wt w ||2 + η2||α1tgzt + α2tg zt||2

2η2α2t g zt, gzt α2tηµ||wt w ||2 + 2α2tηµ||wt w t||2. (160)

This gives the following general form:

||wt+1 w ||2 (1 2ηα1tµ α2tηµ)||wt w ||2 + η2||α1tgzt + α2tg zt||2

2η2α2t g zt, gzt + 2α2tηµ||wt w t||2. (161)

We now use the inequality ||wt w t|| = η2||gzt|| η2β2||wt w || to upper bound the ﬁnal term, see SGD proof for derivation of inequality:

||wt+1 w ||2 (1 2ηα1tµ α2tηµ)||wt w ||2 + η2||α1tgzt + α2tg zt||2

2η2α2t g zt, gzt + 2η3β2µα2t||wt w ||, (162)

Paren, Berrada, Poudel and Kumar

||wt+1 w ||2 (1 2ηα1tµ α2tηµ + 2η3α2tβ2µ)||wt w ||2

+ η2||α1tgzt + α2tg zt||2 2η2α2t g zt, gzt . (163)

We now simplify the last two terms, starting with the ﬁrst:

η2||α1tgzt + α2tg zt||2 = η2((α1t)2||gzt||2 + 2α1tα2t gzt, g zt + (α2t)2||g zt||2). (164)

Plugging in the expressions for α1t, α2t,γt = 1, grouping like terms and simplifying gives the following. Note, we have excluded a few steps due to spatial constraints:

η2||α1tgzt + α2tg zt||2

= (lzt(w t) lzt(wt))2 + 2η(lzt(w t) lzt(wt)) gzt, g zt + η2||gzt||2||g zt||2

||gzt g zt||2 (165)

Plugging in α2t into the remaining term gives the following expressions:

2η2α2t g zt, gzt = 2η2||gzt||2 g zt, gzt 2η(lzt(w t) lzt(wt)) g zt, gzt ||gzt g zt||2 . (166)

Putting these together,

η2||α1tgzt + α2tg zt||2 2η2α2t g zt, gzt

= (lzt(w t) lzt(wt))2 + 2η(lzt(w t) lzt(wt)) gzt, g zt ||gzt g zt||2

+ η2||gzt||2||g zt||2 2η2||gzt||2 g zt, gzt 2η(lzt(w t) lzt(wt)) g zt, gzt ||gzt g zt||2 .

Cancelling terms gives,

η2||α1tgzt + α2tg zt||2 2η2α2t g zt, gzt

= (lzt(w t) lzt(wt))2 + η2||gzt||2||g zt||2 2η2||gzt||2 g zt, gzt ||gzt g zt||2 . (168)

From α2t 0 we have η||gzt||2 lzt(wt) lzt(w t) hence we can upper bound (lzt(w t) lzt(wt))2 by η2||g zt||4:

η2||α1tgzt + α2tg zt||2 2η2α2t g zt, gzt

η2||g zt||4 + η2||gzt||2||g zt||2 2η2||gzt||2 g zt, gzt ||gzt g zt||2 η2||gzt||2. (169)

Again we use the inequality ||wt w t|| = η2||gzt|| η2β2||wt w ||:

η2||α1tgzt + α2tg zt||2 2η2α2t g zt, gzt η2||gzt||2 η2β2||wt+1 w ||2 (170)

Hence we get the following expression:

||wt+1 w ||2 (1 2ηα1tµ α2tηµ + 2η3α2tβ2µ)||wt w ||2 + η2β2||wt+1 w ||2 , (171)

||wt+1 w ||2 (1 2ηα1tµ α2tηµ + 2η3α2tβ2µ + η2β2)||wt w ||2, (172)

||wt+1 w ||2 (1 ηµ α1tηµ + 2η3β2µ 2η3α1tβ2µ + η2β2)||wt w ||2. (173)

A Stochastic Bundle Method for Interpolation

Upper bounding α1t by 0,

||wt+1 w ||2 (1 ηµ + 2η3β2µ + η2β2)||wt w ||2. (174)

Taking expectations with respect to zt:

E[||wt+1 w ||2] E[(1 ηµ + 2η3β2µ + η2β2)||wt w ||2]. (175)

Noting that wk does not depend on zt, and neither does w due to the interpolation property:

E[||wt+1 w ||2] (1 ηµ + 2η3β2µ + η2β2)||wt w ||2. (176)

For the this step to be convergent we need the following condition to hold 2η3β2µ+η2β2 ηµ 0. However, for η ˆη = min{ 1

4µ, µ 2β2 } we have:

2η3β2µ + η2β2 ηµ 1

Hence, we recover the rate:

E[||wt+1 w ||2] 1 3

8 ˆηµ t ||w0 w ||2. (178)

E.4 Worst Case Rate

It is clear by inspection that the worst case rate derived corresponds to the MAX2 subproblem. Hence in the worst case this step is taken for all t, and thus a trivial induction gives the result of Theorem 2.

Appendix F. Additional Results

F.1 Cifar Hyperparameters and Variance

Here we detail the hyperparameters and variance for the ALI-G and BORAT results reported in table 6. For other optimisation methods please refer to Appendix E of Berrada et al. (2019b).

F.2 Empirical Run Time

Here we detail the eﬀect of increasing N on the run time of BORAT. Due to each update requiring N 1 gradient evaluations BORAT with N 2 take signiﬁcantly longer between updates than other methods. However, BORAT achieves good empirical convergence rates taking N 1 fewer parameter updates than other methods, as shown in the results section. Hence we consider the epoch time and show for each pass through the data BORAT has a similar run time to SGD. Increasing N both increases the run time of Algorithm 1 but additionally BORAT must compute extra dot products when calculating Q. A naive implementation of Algorithm 1 has a time complexity of O PN k=1 N! k!(N k)!k3 . However if we exploit the parallel nature of this algorithm where the sub problems are solved simultaneously, the time complexity reduces to

Paren, Berrada, Poudel and Kumar

Data Set Model Hyperparameters Test Accuracy

N η r batchzise Mean std

WRN 2 0.1 50 128 95.4 0.13 3 1 100 128 95.4 0.05 5 1 75 128 95.0 0.08

DN 2 0.1 100 64 94.5 0.09 3 1 75 256 94.9 0.13 5 1 75 128 94.9 0.13

WRN 2 0.1 50 512 76.1 0.21 3 0.1 50 256 76.0 0.16 5 0.1 50 128 75.8 0.22

DN 2 0.1 75 256 76.2 0.14 3 0.1 75 128 76.5 0.38 5 0.1 75 64 75.7 0.03

Table 6: Cifar Hyperparameters (BORAT)

O N3 as discussed in Section 3.6. Additionally we need only run Algorithm 1 once every N 1 batches so the per epoch time complexity is O N2 . Of course in practice Algorithm 1 is only responsible for a fraction of the run time, where its contribution is determined by the relative size of the model and N. Table 7 show with a parallel implementation the eﬀect of the this extra computation on the training epoch time isn t signiﬁcant and the time complexity scales approximately linearly with N. Moreover, Table 7 shows for large scale learning problems, such as Image Net the extra run time when increasing N is negligible.

Optimiser SGD BORAT BORAT BORAT BORAT

N 1 2 3 4 5 Time (s) 51.0 55.6 68.2 69.2 74.3

Optimiser BORAT BORAT BORAT BORAT BORAT

N 6 7 8 9 10 Time (s) 77.8 82.8 88.7 94.1 99.5

Table 7: Average BORAT training epoch time for CIFAR100 data set, shown for varying N. Time quoted using a batch size of 128, CIFAR100, CE loss, a Wide Res Net 40-4, and a parallel implantation of BORAT. All Optimiser had access to 3 CPU cores, and one TITAN Xp GPU.

A Stochastic Bundle Method for Interpolation

Optimiser BORAT BORAT BORAT

N 2 3 5 Time (s) 885.50 910.49 934.79

Table 8: Average BORAT training epoch time for Image Net data set, shown for varying N. Time quoted using a batch size of 1024, Image Net, CE loss, a Res Net18, and a parallel implantation of BORAT. All Optimiser had access to 12 CPU cores, and 4 TITAN Xp GPUs.

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeﬀrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoﬀrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelioné Man, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernandaégas Vi, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor Flow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. Software available from tensorﬂow.org.

Hilal Asi and John C Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM Journal on Optimization, 2019.

Alfred Auslender. Bundle methods for machine learning. Numerical Methods for Nondiﬀerentiable Convex Optimization. Mathematical Programming Study, 2009.

Francis R. Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optimization with sparsity-inducing penalties. Co RR, 2011.

Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, and Frank Wood. Online learning rate adaptation with hypergradient descent. International Conference on Learning Representations, 2018.

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd: Compressed optimisation for non-convex problems. International Conference on Machine Learning, 2018.

Leonard Berrada, Andrew Zisserman, and M Pawan Kumar. Deep Frank-Wolfe for neural network optimization. International Conference on Learning Representations, 2019a.

Leonard Berrada, Andrew Zisserman, and M Pawan Kumar. Training neural networks for and by interpolation. International Conference on Machine Learning, 2019b.

Dimitri P. Bertsekas. Convex Optimization Theory. Athena Scientiﬁc, 2009.

Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. Conference on Empirical Methods in Natural Language Processing, 2015.

Paren, Berrada, Poudel and Kumar

Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 2015.

Jinghui Chen and Quanquan Gu. Padam: Closing the generalization gap of adaptive gradient methods in training deep neural networks. ar Xiv preprint, 2018.

Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of adam-type algorithms for non-convex optimization. International Conference on Learning Representations, 2019.

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. Conference on Empirical Methods in Natural Language Processing, 2017.

Alexandre Défossez and Francis Bach. Adabatch: Eﬃcient gradient aggregation rules for sequential and parallel stochastic gradient methods. ar Xiv preprint, 2017.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011.

Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 1956.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Conference on Computer Vision and Pattern Recognition, 2016.

João F Henriques, Sebastien Ehrhardt, Samuel Albanie, and Andrea Vedaldi. Small steps and giant leaps: Minimal newton solvers for deep learning. International Conference on Computer Vision, 2019.

Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. Conference on Computer Vision and Pattern Recognition, 2017.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.

Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. International Conference on Learning Representations, 2014.

Simon Lacoste-Julien and Martin Jaggi. Block-coordinate Frank-Wolfe optimization for structural SVMs. International Conference on Machine Learning, 2013.

Maksim Lapin, Matthias Hein, and Bernt Schiele. Loss functions for top-k error: Analysis and insights. Conference on Computer Vision and Pattern Recognition, 2016.

Claude Lemaréchal, Arkadi Nemirovski, and Yurii Nesterov. New variants of bundle methods. Math. Program, 1995.

Kﬁr Levy. Online to oﬄine conversions, universality and adaptive minibatch sizes. Neural Information Processing Systems, 2017.

A Stochastic Bundle Method for Interpolation

Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. Journal of Machine Learning Research, 2020.

Xiaoyu Li and Francesco Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. International Conference on Artiﬁcial Intelligence and Statistics, 2019.

Changliu Liu, Tomer Arnon, Christopher Lazarus, Clark Barrett, and Mykel J Kochenderfer. Algorithms for verifying deep neural networks. ar Xiv:1903.06758, 2019a.

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. ar Xiv preprint, 2019b.

Francesco Locatello, Rajiv Khanna, Michael Tschannen, and Martin Jaggi. A uniﬁed optimization view on generalized matching pursuit and frank-wolfe. International Conference on Artiﬁcial Intelligence and Statistics, 2017.

Nicolas Loizou, Sharan Vaswani, Issam Laradji, and Simon Lacoste-Julien. Stochastic polyak step-size for sgd: An adaptive learning rate for fast convergence. ar Xiv preprint, 2020.

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. International Conference on Learning Representations, 2017.

Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. International Conference on Learning Representations, 2019.

Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. International Conference on Learning Representations, 2019.

Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the eﬀectiveness of sgd in modern over-parametrized learning. International Conference on Machine Learning, 2018a.

Xingjun Ma, Bo Li, Yisen Wang, Sarah M Erfani, Sudanthi Wijewickrema, Grant Schoenebeck, Dawn Song, Michael E Houle, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. International Conference on Learning Representations, 2018b.

James Martens and Roger Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. International Conference on Machine Learning, 2015.

Mahesh Chandra Mukkamala and Matthias Hein. Variants of rmsprop and adagrad with logarithmic regret bounds. International Conference on Machine Learning, 2017.

Adam M Oberman and Mariana Prazeres. Stochastic gradient descent with polyak s learning rate. ar Xiv preprint, 2019.

Paren, Berrada, Poudel and Kumar

Francesco Orabona and Dávid Pál. Scale-free algorithms for online linear optimization. International Conference on Algorithmic Learning Theory, 2015.

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic diﬀerentiation in pytorch. NIPS AutodiﬀWorkshop, 2017.

Boris Teodorovich Polyak. Minimization of unsmooth functionals. USSR Computational Mathematics and Mathematical Physics, 1969.

Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. International Conference on Learning Representations, 2018.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, 1951.

Michal Rolinek and Georg Martius. L4: Practical loss-based stepsize adaptation for deep learning. Neural Information Processing Systems, 2018.

Tom Schaul, Sixin Zhang, and Yann Le Cun. No more pesky learning rates. International Conference on Machine Learning, 2013.

Frank Schneider, Lukas Balles, and Philipp Hennig. Deep OBS: A deep learning optimizer benchmark suite. International Conference on Learning Representations, 2019.

Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 2016.

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. International Conference on Machine Learning, 2018.

Alexander J. Smola, S. V. N. Vishwanathan, and Quoc V. Le. Bundle methods for machine learning. Neural Information Processing Systems, 2007.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. Conference on Computer Vision and Pattern Recognition, 2015.

Conghui Tan, Shiqian Ma, Yu-Hong Dai, and Yuqiu Qian. Barzilai-borwein step size for stochastic gradient descent. Neural Information Processing Systems, 2016.

Tijmen Tieleman and Geoﬀrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 2012.

Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. International Conference on Artiﬁcial Intelligence and Statistics, 2019a.

A Stochastic Bundle Method for Interpolation

Sharan Vaswani, Aaron Mishkin, Issam Laradji, Mark Schmidt, Gauthier Gidel, and Simon Lacoste-Julien. Painless stochastic gradient: Interpolation, line-search, and convergence rates. ar Xiv preprint, 2019b.

Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. Neural Information Processing Systems, 2017.

Xiaoxia Wu, Rachel Ward, and Léon Bottou. WNGrad: Learn the learning rate in gradient descent. ar Xiv preprint, 2018.

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. British Machine Vision Conference, 2016.

Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, and Sanjiv Kumar. Adaptive methods for nonconvex optimization. Neural Information Processing Systems, 2018.

Matthew Zeiler. ADADELTA: an adaptive learning rate method. ar Xiv preprint, 2012.

Ziming Zhang, Yuanwei Wu, and Guanghui Wang. Bpgrad: Towards global optimality in deep learning via branch and pruning. Conference on Computer Vision and Pattern Recognition, 2017.

Shuai Zheng and James T Kwok. Follow the moving leader in deep learning. International Conference on Machine Learning, 2017.

Yi Zhou, Junjie Yang, Huishuai Zhang, Yingbin Liang, and Vahid Tarokh. Sgd converges to global minimum in deep learning via star-convex path. International Conference on Learning Representations, 2019.