# using_statistics_to_automate_stochastic_optimization__4ba1a2bf.pdf

Using Statistics to Automate Stochastic Optimization

Hunter Lang Pengchuan Zhang Lin Xiao Microsoft Research AI Redmond, WA 98052, USA {hunter.lang, penzhan, lin.xiao}@microsoft.com

Despite the development of numerous adaptive optimizers, tuning the learning rate of stochastic gradient methods remains a major roadblock to obtaining good practical performance in machine learning. Rather than changing the learning rate at each iteration, we propose an approach that automates the most common hand-tuning heuristic: use a constant learning rate until progress stops , then drop. We design an explicit statistical test that determines when the dynamics of stochastic gradient descent reach a stationary distribution. This test can be performed easily during training, and when it ﬁres, we decrease the learning rate by a constant multiplicative factor. Our experiments on several deep learning tasks demonstrate that this statistical adaptive stochastic approximation (SASA) method can automatically ﬁnd good learning rate schedules and match the performance of hand-tuned methods using default settings of its parameters. The statistical testing helps to control the variance of this procedure and improves its robustness.

1 Introduction

Stochastic approximation methods, including stochastic gradient descent (SGD) and its many variants, serve as the workhorses of machine learning with big data. Many tasks in machine learning can be formulated as the stochastic optimization problem:

minimizex2Rn F(x) , E

, where is a random variable representing data sampled from some (unknown) probability distribution,

x 2 Rn represents the parameters of the model (e.g., the weight matrices in a neural network), and f is a loss function. In this paper, we focus on the following variant of SGD with momentum,

dk+1 = (1 βk)gk + βkdk,

xk+1 = xk kdk+1, (1)

where gk = rx f (xk, k) is a stochastic gradient, k > 0 is the learning rate, and βk 2 [0, 1) is the momentum coeﬃcient. This approach can be viewed as an extension of the heavy-ball method (Polyak, 1964) to the stochastic setting.1 To distinguish it from the classical SGD, we refer to the method (1) as SGM (Stochastic Gradient with Momentum).

Theoretical conditions on the convergence of stochastic approximation methods are well established (see, e.g., Wasan, 1969; Kushner and Yin, 2003, and references therein). Unfortunately, these asymptotic conditions are insuﬃcient in practice. For example, the classical rule k = a/(k + b)c where a, b > 0 and 1/2 < c 1, often gives poor performance even when a, b, and c are hand-tuned. Additionally, despite the advent of numerous adaptive variants of SGD and SGM (e.g., Duchi et al., 2011; Tieleman and Hinton, 2012; Kingma and Ba, 2014, and other variants), achieving good performance in practice often still requires considerable hand-tuning (Wilson et al., 2017).

1For ﬁxed values of and β, this normalized update formula is equivalent to the more common updates dk+1 = gk + βdk, xk+1 = xk 0dk+1 with the reparametrization = 0/(1 β).

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

(a) training loss (b) test accuracy (c) learning rate

Figure 1: Smoothed training loss, test accuracy, and (global) learning rate schedule for an 18-layer Res Net model (He et al., 2016) trained on the CIFAR-10 dataset using four diﬀerent methods (with constant momentum β = 0.9). Adam: = 0.0001; SGM-const: = 1.0; SGM-poly: a = 1.0, b = 1, c = 0.5; SGM-hand: = 1.0, drop by 10 every 50 epochs.

Algorithm 1: General SASA method

Input: {x0, 0, M, β, }

1 for j 2 {0, 1, . . .} do

2 for k 2 {j M, . . ., (j + 1)M 1} do

3 Sample k.

4 Compute gk = rx f (xk, k).

5 dk+1 = (1 β)gk + βdk

6 xk+1 = xk dk+1

7 // collect statistics

9 if test(statistics) then

11 // reset statistics

Figure 1 shows the training loss and test accuracy of a typical deep learning task using four diﬀerent methods: SGM with constant step size (SGM-const), SGM with diminishing O(1/k) step size (SGM-poly), Adam (Kingma and Ba, 2014), and hand-tuned SGM with learning rate scheduling (SGM-hand). For the last method, we decrease the step size by a multiplicative factor after a suitably long number of epochs ( constantand-cut ). The relative performance depicted in Figure 1 is typical of many tasks in deep learning. In particular, SGM with a large momentum and a constant-and-cut step-size schedule often achieves the best performance. Many former and current state-of-the-art results use constantand-cut schedules during training, such as those in image classiﬁcation (Huang et al., 2018), object detection (Szegedy et al., 2015), machine translation (Gehring et al., 2017), and speech recognition (Amodei et al., 2016). Additionally, some recent theoretical evidence indicates that in some (strongly convex) scenarios, the constant-and-cut scheme has better ﬁnite-time last-iterate convergence performance than other methods (Ge et al., 2019).

Inspired by the success of the constant-and-cut scheduling approach, we develop an algorithm that can automatically decide when to drop . Most common heuristics try to identify when training progress has stalled. We formalize stalled progress as when the SGM dynamics in (1), with constant values of and β, reach a stationary distribution. The existence of such a distribution seems to match well with many empirical results (e.g., Figure 1), though it may not exist in general. Since SGM generates a rich set of information as it runs (i.e. {x0, g0, . . ., xk, gk}), a natural approach is to collect some statistics from this information and perform certain tests on them to decide whether the process (1) has reached a stationary distribution. We call this general method SASA: statistical adaptive stochastic approximation.

Algorithm 1 summarizes the general SASA method. It performs the SGM updates (1) in phases of

M iterations, in each iteration potentially computing some additional statistics. After M iterations are complete, the algorithm performs a statistical test to decide whether to drop the learning rate by a factor < 1. Dropping after a ﬁxed number of epochs and dropping based on the loss of a held-out validation set correspond to heuristic versions of Algorithm 1. In the rest of this work, we detail how to perform the test procedure and evaluate SASA on a wide range of deep learning tasks.

1.1 Related Work and Contributions

The idea of using statistical testing to augment stochastic optimization methods goes back at least to Pﬂug (1983), who derived a stationary condition for the dynamics of SGD on quadratic functions

and designed a heuristic test to determine when the dynamics had reached stationary. He used this test to schedule a ﬁxed-factor learning rate drop. Chee and Toulis (2018) recently re-investigated Pﬂug s method for general convex functions. Pﬂug s stationary condition relies heavily on a quadratic approximation to F and limiting noise assumptions, as do several other recent works that derive a stationary condition (e.g., Mandt et al., 2017; Chee and Toulis, 2018). Additionally, Pﬂug s test assumes no correlation exists between consecutive samples in the optimization trajectory. Neither is true in practice, which we show in Appendix C.2 can lead to poor predictivity of this condition.

Yaida (2018) derived a very general stationary condition that does not depend on any assumption about the underlying function F and applies under general noise conditions regardless of the size of . Like Pﬂug (1983), Yaida (2018) used this condition to determine when to decrease , and showed good performance compared to hand-tuned SGM for two deep learning tasks with small models. However, Yaida s method does not account for the variance of the terms involved in the test, which we show can cause large variance in the learning rate schedules in some cases. This variance can in turn cause poor empirical performance.

In this work, we show how to more rigorously perform statistical hypothesis testing on samples collected from the dynamics of SGM. We combine this statistical testing with Yaida s stationary condition to develop an adaptive constant-and-cut optimizer (SASA) that we show is more robust than present methods. Finally, we conduct large-scale experiments on a variety of deep learning tasks to demonstrate that SASA is competitive with the best hand-tuned and validation-tuned methods without requiring additional tuning.

2 Stationary Conditions

To design a statistical test that ﬁres when SGM reaches a stationary distribution, we ﬁrst need to derive a condition that holds at stationarity and consists of terms that we can estimate during training. To do so, we analyze the long-run behavior of SGM with constant learning rate and momentum parameter:

dk+1 = (1 β)gk + βdk,

xk+1 = xk dk+1,

where > 0 and 0 β < 1. This process starts with d0 = 0 and arbitrary x0. Since and β are constant, the sequence {xk} does not converge to a local minimum, but the distribution of {xk} may converge to a stationary distribution. Letting Fk denote the σ-algebra deﬁned by the history of the process (2) up to time k, i.e., Fk = σ(d0, . . ., dk; x0, . . ., xk), we denote by Ek[ ] := E[ |Fk] the expectation conditioned on that history. Assuming that gk is Markovian and unbiased, i.e.,

P[gk|Fk] = P[gk|dk, xk], E[gk|dk, xk] = r F(xk), (3)

then the SGM dynamics (2) form a homogeneous2 Markov chain (Bach and Moulines, 2013; Dieuleveut et al., 2017) with continuous state (dk, xk, gk) 2 R3n. These assumptions are always satisﬁed when gk = rx f (xk, k) for an i.i.d. sample k. We further assume that the SGM process converges to a stationary distribution, denoted as (d, x, g)3. With this notation, we need a relationship E [X] = E [Y] for certain functions X and Y of (xk, dk, gk) that we can compute during training. Then, if we assume the Markov chain is ergodic, we have that:

X(xi, di, gi) Y(xi, di, gi)

Then we can check the magnitude of the time-average z N to see how close the dynamics are to reaching their stationary distribution. Next, we consider two diﬀerent stationary conditions.

2.1 Pﬂug s condition

Assuming F(x) = (1/2)x T Ax, where A is positive deﬁnite with maximum eigenvalue L, and that the stochastic gradient gk satisﬁes gk = r F(xk) + rk, with E[rk] = 0 and rk independent of xk, Pﬂug

2 Homogeneous means that the transition kernel is time independent. 3As stated in Section 1, this need not be true in general, but seems to often be the case in practice.

(1983) derived a stationary condition for the SGD dynamics. His condition can be extended to the SGM dynamics. For appropriate and β, the generalized Pﬂug stationary condition says

2(1 + β) tr(A r) + O( 2), (5)

where r is the covariance of the noise r. One can estimate the left-hand-side during training by computing the inner product hgk, dki in each iteration. Pﬂug (1983) also designed a clever estimator for the right-hand-side, so it is possible to compute estimators for both sides of (5).

The Taylor expansion in used to derive (5) means that the relationship may only be accurate for small , but is typically large in the ﬁrst phase of training. This, together with the other assumptions required for Pﬂug s condition, are too strong to make the condition (5) useful in practice.

2.2 Yaida s condition

Yaida (2018) showed that as long as the stationary distribution exists, the following relationship holds exactly:

E [hx, r F(x)i] =

1 + β 1 βE [hd, di]

In particular, this holds for general functions F and arbitrary values of . Because the stochastic gradients gk are unbiased, one can further show that:

E [hx, gi] =

1 + β 1 βE [hd, di]. (6)

In the quadratic, i.i.d. noise setting of Section 2.1, the left-hand-side of (6) is simply E [x T Ax], twice the average loss value at stationarity. So this condition can be considered as a generalization of testing for when the loss is stationary. We can estimate both sides of (6) by computing hxk, gki and

hdk, dki = ||dk||2 at each iteration and updating the running mean z N with their diﬀerence. That is, we let

zk = hxk, gki

1 + β 1 β hdk, dki z N = 1

Here B is the number of samples discarded as part of a burn-in phase to reduce bias that might be caused by starting far away from the stationary distribution; we typically take B = N/2, so that we use the most recent N/2 samples.

Yaida s condition has two key advantages over Pﬂug s: it holds with no approximation for arbitrary functions F and any learning rate , and both sides can be estimated with negligible cost. In Appendix C.2, we show in Figure 14 that even on a strongly convex function, the error term in (5) is large, whereas z N in (7) quickly converges to zero. Given these advantages, in the next section, we focus on how to test (6), i.e., that z N deﬁned in (7) is approximately zero.

3 Testing for Stationarity

By the Markov chain law of large numbers, we know that z N ! 0 as N grows, but there are multiple ways to determine whether z N is close enough to zero that we should drop the learning rate.

Deterministic test. If in addition to z N in (7), we keep track of

1 + β 1 β hdi, dii, (8) A natural idea is to test

| z N | < δ v N or equivalently | z N/ v N | < δ (9) to detect stationarity, where δ > 0 is an error tolerance. The v N term is introduced to make the error term δ relative to the scale of z and v ( v N is always nonnegative). If z N satisﬁes (9), then the dynamics (2) are close to stationarity. This is precisely the method used by Yaida (2018).

However, because z N is a random variable, there is some potential for error in this procedure due to its variance, which is unaccounted for by (9). Especially when we aim to make a critical decision based on the outcome of this test (i.e., dropping the learning rate), it seems important to more directly account for this variance. To do so, we can appeal to statistical hypothesis testing.

I.i.d. t-test. The simplest approach to accounting for the variance in z N is to assume each sample zi is drawn i.i.d. from the same distribution. Then by the central limit theorem, we have that p

N z N ! N(0, σ2

z ), and moreover ˆσ2

i=1(zi z N)2 σ2

z for large N. So we can estimate the variance of z N s sampling distribution using the sample variance of the zi s. Using this variance estimate, we can form the (1 γ) conﬁdence interval

1 γ/2 is the (1 γ/2) quantile of the Student s t-distribution with N 1 degrees of freedom. Then we can check whether

2 ) δ v N, δ v N

If so, we can be conﬁdent that z N is close to zero. The method of Pﬂug (1983, Algorithm 4.2) is also a kind of i.i.d. test that tries to account for the variance of z N, but in a more heuristic way than (10). The procedure (10) can be thought of as a relative equivalence test in statistical hypothesis testing (e.g. Streiner, 2003). When ˆσN = 0 (no variance) or γ = 1 (t

1 γ/2 = 0, no conﬁdence), this recovers (9).

Unfortunately, in our case, samples zi evaluated at nearby points are highly correlated (due to the underlying Markov dynamics), which makes this procedure inappropriate. To deal with correlated samples, we appeal to a stronger Markov chain result than the Markov chain law of large numbers (4).

Markov chain t-test Under suitable conditions, Markov chains admit the following analogue of the central limit theorem:

Theorem 1 (Markov Chain CLT (informal); (Jones et al., 2006)). Let X = {X0, X1, . . .} be a Harris ergodic Markov chain with state space X, and with stationary distribution , that satisﬁes any one of a number of additional ergodicity criteria (see Jones et al. (2006), page 6). For suitable functions z : X ! R, we have that: p

N ( z N E z) ! N(0, σ2

where z N = 1

i=0 z(Xi) is the running mean over time of z(Xi), and σ2

z , var z in general due to correlations in the Markov chain.

This shows that in the presence of correlation, the sample variance is not the correct estimator for the variance of z N s sampling distribution. In light of Theorem 1, if we are given a consistent estimator

z , we can properly perform the test (10). All that remains is to construct such an estimator.

Batch mean variance estimator. Methods for estimating the asymptotic variance of the history average estimator, e.g., z N in (7), on a Markov chain are well-studied in the MCMC (Markov chain Monte Carlo) literature. They can be used to set a stopping time for an MCMC simulation and to determine the simulation s random error (Jones et al., 2006). We present one of the simplest estimators for σ2

z , the batch means estimator.

Given N samples {zi}, divide them into b batches each of size m, and compute the batch means:

i=jm zi for each batch j. Then let

( zj z N)2. (11)

N is simply the variance of the batch means around the full mean z N. When used in the test (10), it has b 1 degrees of freedom. Intuitively, when b and m are both large enough, these batch means are roughly independent because of the mixing of the Markov chain, so their unbiased sample variance gives a good estimator of σ2

z . Jones et al. (2006) survey the formal conditions under which ˆσ2

N is a strongly consistent estimator of σ2

z , and suggest taking b = m =

N (the theoretically correct sizes of b and m depend on the mixing of the Markov chain). Flegal and Jones (2010) prove strong consistency for a related method called overlapping batch means (OLBM) that has better asymptotic variance. The OLBM estimator is similar to (11), but uses n b + 1 overlapping batches of size b and has n b degrees of freedom.

Algorithm 2: SASA

Input: {x0, 0, M, β, δ, γ, }

1 z Q = Half Queue()

2 v Q = Half Queue()

3 for j 2 {0, 1, 2, . . .} do

4 for k 2 {j M, . . ., (j + 1)M 1} do

5 Sample k and compute gk = rx f (xk, k)

6 dk+1 = (1 β)gk + βdk

7 xk+1 = xk dk+1

8 z Q.push(hxk, gki

1+β 1 β ||dk+1||2)

9 v Q.push(

1+β 1 β ||dk+1||2)

11 if test(z Q, v Q, δ, γ) then

13 z Q.reset()

14 v Q.reset()

Algorithm 3: Test

Input: {z Q, v Q, δ, γ} Output: boolean (whether to drop)

1 z N = 1 z Q.N

2 v N = 1 v Q.N

3 m = b = pz Q.N

4 for i 2 {0, . . ., b 1} do

t=im z Q[t]

i=0 ( zi z N)2.

8 L = z N t

9 U = z N + t

ˆσN pz Q.N 10 return [L, U] 2 ( δ v N, δ v N)

3.1 Statistical adaptive stochastic approximation (SASA)

Finally, we turn the above analysis into an adaptive algorithm for detecting stationarity of SGM and decreasing , and discuss implementation details. Algorithm 2 describes our full SASA algorithm.

To diminish the eﬀect of initialization bias due to starting outside of the stationary distribution, we only keep track of the latter half of samples zi and vi. That is, if N total iterations of SGM have been run, the Half Queues z Q and v Q contain the most recent N/2 values of zi and vi these queues pop every other time they push. If we decrease the learning rate, we empty the queues; otherwise,

we keep collecting more samples. To compute the batch mean estimator, we need O(N) space, but in deep learning the total number of training iterations (the worst case size of these queues) is usually small compared to the number of parameters of the model. Collection of the samples zi and vi only requires two more inner products per iteration than SGM.

The test algorithm follows the Markov chain t-test procedure discussed above. Lines 1-2 compute the running means z N and v N; lines 3-7 compute the variance estimator ˆσ2

N according to (11), and lines 8-10 determine whether the corresponding conﬁdence interval for z N is within the acceptable interval ( δ v N, δ v N). Like the sample collection, the test procedure is computationally eﬃcient: the batch mean and overlapping batch mean estimators can both be computed with a 1D convolution.

For all experiments, we use default values δ = 0.02 and γ = 0.2. In equivalence testing, γ is typically taken larger than usual to increase the power of the test (Streiner, 2003). We discuss the apparent multiple testing problem of this sequential testing procedure in Appendix D.

4 Experiments

To evaluate the performance of SASA, we run Algorithm 2 on several models from deep learning. We compare SASA to tuned versions of Adam and SGM. Many adaptive optimizers do not compare to SGM with hand-tuned step size scheduling, (e.g., Schaul et al., 2013; Zhang and Mitliagkas, 2017; Baydin et al., 2018), and instead compare to SGM with a ﬁxed or to SGM with tuned polynomial decay. As detailed in Section 1, tuned constant-and-cut schedules are typically a stronger baseline.

Throughout this section, we do not tune the SASA parameters δ, γ, M, instead using the default settings of δ = 0.02 and γ = 0.2, and setting M = one epoch (we test the statistics once per epoch). In each experiment, we use the same 0 and as for the best SGM baseline. We stress that SASA is not fully automatic: it requires choices of 0 and , but we show in Appendix A that SASA achieves good performance for diﬀerent values of . We use weight decay in every experiment without weight decay, there are simple examples where the process (2) does not converge to a stationary distribution,

Figure 2: Training loss, test accuracy, and learning rate schedule for SASA, SGM, and Adam on diﬀerent datasets. Top: Res Net18 on CIFAR-10. Middle: Res Net18 on Image Net. Bottom: RNN model on Wiki Text-2. In all cases, starting with the same 0, SASA achieves similar performance to the best hand-tuned or validation-tuned SGM result. Across three independent runs, the variance of each optimizer s best test accuracy was never larger than 1%, and the relative orderings between optimizers held for every run. Figure 5 studies the variance of SASA in a semi-synthetic setting.

(a) (b) (c) (d)

Figure 3: Evolution of the diﬀerent statistics for SASA over the course of training Res Net18 on CIFAR-10 using the default parameters δ = 0.02, γ = 0.2, = 0.1. Panel (a) shows the raw data for both sides of condition (6). That is, it shows the values of hxk, gki and

1+β 1 β hdk, dki at each iteration. Panel (1) shows z N with its lower and upper conﬁdence interval [lci, uci] and the right hand side (rhs) ( δ v N, δ v N) (see Eqn. (10)). Panel (c) shows a zoomed-in version of (b) to show the drop points in more detail. Panel (d) depicts the diﬀerent variance estimators (i.i.d., batch means, overlapping batch means) over the course of training. The i.i.d. variance (green) is a poor estimate of σ2

such as with logistic regression on separable data. While weight decay does not guarantee convergence to a stationary distribution, it at least rules out this simple case. Finally, we conduct an experiment on CIFAR-10 that shows directly accounting for the variance of the test statistic, as in (10), improves the robustness of this procedure compared to (9).

For hand-tuned SGM (SGM-hand), we searched over constant-and-cut schemes for each experiment by tuning 0, the drop frequency, and the drop amount with grid search. In all experiments, SASA and SGM use a constant β = 0.9. For Adam, we tuned the initial global learning rate as in Wilson et al. (2017) and used β1 = 0.9, β2 = 0.999. We also allowed Adam to have access to a warmup phase to prevent it from decreasing the learning rate too quickly. To warm up Adam, we initialize it with the

Figure 4: Smoothed training loss, test accuracy and learning rate schedule for Res Net18 trained on Image Net using SASA with diﬀerent values of . SASA automatically adapts the drop frequency.

parameters obtained after running SGM with constant 0 for a tuned number of iterations. While the warmup phase improves Adam s performance, it still does not match SASA or SGM on the tasks we tried. Appendix A contains a full list of the hyperparameters used in each experiment, additional results for object detection, sensitivity analysis for δ and γ, and plots of the diﬀerent estimators for the variance σ2

CIFAR-10. We trained an 18-layer Res Net model4 He et al. (He et al., 2016) on CIFAR-10 (Krizhevsky and Hinton, 2009) with random cropping and random horizontal ﬂipping for data augmentation and weight decay 0.0005. Row 1 of Figure 2 compares the best performance of each method. Here SGM-hand uses 0 = 1.0 and β = 0.9 and drops by a factor of 10 ( = 0.1) every 50 epochs. SASA uses γ = 0.2 and δ = 0.02, as always. Adam has a tuned global learning rate 0 = 0.0001 and a tuned warmup phase of 50 epochs, but is unable to match SASA and tuned SGM.

Evolution of statistics. Figure 3 shows the evolution of SASA s diﬀerent statistics over the course of training the Res Net18 model on CIFAR-10 using the default parameter settings δ = 0.02, γ = 0.2, = 0.1. In each phase, the running average of the diﬀerence between the statistics, z N, decays toward zero. The learning rate drops once z N and its conﬁdence interval are contained in ( δ v N, δ v N); see Eqn (10). After the drop, the statistics increase in value and enter another phase of convergence. The batch means variance estimator (BM) and overlapping batch means variance estimator (OLBM) give very similar estimates of the variance, while the i.i.d. variance estimator, as expected, gives quite diﬀerent values.

Image Net. Unlike CIFAR-10, reaching a good performance level on Image Net (Deng et al., 2009) seems to require more gradual annealing. Even when tuned and allowed to have a long warmup phase, Adam failed to match the generalization performance of SGM. On the other hand, SASA was able to match the performance of hand-tuned SGM using the default values of its parameters. We again used an 18-layer Res Net model with random cropping, random ﬂipping, normalization, and weight decay 0.0001. Row 2 of Figure 2 shows the performance of the diﬀerent optimizers.

RNN. We also evaluate SASA on a language modeling task using an RNN. In particular, we train the Py Torch word-level language model example (2019) on the Wikitext-2 dataset (Merity et al., 2016). We used 600-dimensional embeddings, 600 hidden units, tied weights, and dropout 0.65, and gradient clipping with threshold 2.0 (note that this model is not state-of-the-art for Wikitext-2). We compare against SGM and Adam with (global) learning rate tuned using a validation set. These baselines drop the learning rate by a factor of 4 when the validation loss stops improving. Row 3 of Figure 2 shows that without using the validation set, SASA is competitive with these baselines.

Adaptation to the drop factor. At ﬁrst glance, the choice of the drop factor seems critical. However, Figure 4 shows that SASA automatically adapts to diﬀerent values of . When is larger, so decreases slower, the dynamics converge more quickly to the stationary distribution, so the overall rate of decrease stays roughly constant across diﬀerent values of . Aside from the diﬀerent choices of , all other hyperparameters were the same as in the Image Net experiment of Figure 2.

Variance. Figure 5 shows the variance in learning rate schedule and training loss for the two tests in (9) (top row) and (10) (bottom row) with a ﬁxed testing frequency M = 400 iterations, across ﬁve independent runs. The model is Res Net18 trained on CIFAR-10 using the same procedure as

4In our experiments on CIFAR-10, we used the slightly modiﬁed Res Net model of https://github.com/kuangliu/pytorch-cifar, which we found to give a small performance gain over the model of He et al. (2016) for all optimizers we tested. The ﬁrst convolutional layer in this model uses ﬁlter size 3 with stride 1 and padding 1, rather than 7, 2, and 3, respectively.

Figure 5: Variance in learning rate schedule and training loss for the two tests (9) (top row) and (10) (bottom row) with ﬁxed testing frequency M, across ﬁve independent runs. The left two columns use batch size four, and the right two use batch size eight. With the same testing frequency and the same value of δ (0.02), the test (9) is much more sensitive to the level of noise. In row 1, column 2, only one of the ﬁve runs (plotted in red) achieves a low training loss because of the high variance in schedule (row 1, column 1).

in the previous CIFAR experiment, but with diﬀerent batch sizes. The left two columns use batch size four, and the right two use batch size eight. With the same testing frequency and the same value of δ = 0.02, the test (9) is much more sensitive to the level of noise in these small-batch examples. When the batch size is four, only one of the training runs using the test (9) achieves training loss on the same scale as the others. Appendix B contains additional discussion comparing these two tests.

5 Conclusion

We provide a theoretically grounded statistical procedure for automatically determining when to decrease the learning rate in constant-and-cut methods. On the tasks we tried, SASA was competitive with the best hand-tuned schedules for SGM, and it came close to the performance of SGM and Adam when they were tuned using a validation set. The statistical testing procedure controls the variance of the method and makes it more robust than other more heuristic tests. Our experiments across several diﬀerent tasks and datasets did not require any adjustment to the parameters γ, δ, or M.

We believe these practical results indicate that automatic constant-and-cut algorithms are a promising direction for future research in adaptive optimization. We used a simple statistical test to check Yaida s stationary condition (6). However, there may be better tests that more properly control the false discovery rate (Blanchard and Roquain, 2009; Lindquist and Mejia, 2015), or more sophisticated conditions that also account for non-stationary dynamics like overﬁtting or limit cycles (Yaida, 2018). Such techniques could make the SASA approach more broadly useful.

Pytorch word language model. https://github.com/pytorch/examples/tree/master/word_

language_model, 2019.

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg,

Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning, pages 173 182, 2016.

Francis Bach and Eric Moulines. Non-strongly-convex smooth stochastic approximation with

convergence rate o (1/n). In Advances in neural information processing systems, pages 773 781, 2013.

Atilim Günes Baydin, Robert Cornish, David Martínez Rubio, Mark Schmidt, and Frank Wood. Online

learning rate adaptation with hypergradient descent. In Proceedings of the Sixth International Conference on Learning Representations (ICLR), Vancouver, Canada, 2018.

Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful

approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57 (1):289 300, 1995.

Gilles Blanchard and Étienne Roquain. Adaptive false discovery rate control under independence and

dependence. Journal of Machine Learning Research, 10(Dec):2837 2871, 2009.

Jerry Chee and Panos Toulis. Convergence diagnostics for stochastic gradient descent with constant

learning rate. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1476 1485, 2018.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Image Net: A Large-Scale Hierarchical

Image Database. In CVPR09, 2009.

Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size

stochastic gradient descent and markov chains. ar Xiv preprint ar Xiv:1707.06386, 2017.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and

stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011.

James M Flegal and Galin L Jones. Batch means and spectral variance estimators in markov chain

monte carlo. The Annals of Statistics, 38(2):1034 1070, 2010.

Rong Ge, Sham M Kakade, Rahul Kidambi, and Praneeth Netrapalli. The step decay schedule: A near

optimal, geometrically decaying learning rate procedure. ar Xiv preprint ar Xiv:1904.12838, 2019.

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional

sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1243 1252. JMLR. org, 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual networks for image

recognition. In Proceedgins of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770 778, 2016.

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the

IEEE international conference on computer vision, pages 2961 2969, 2017.

Yanping Huang, Yonglong Cheng, Dehao Chen, Hyouk Joong Lee, Jiquan Ngiam, Quoc V Le, and

Zhifeng Chen. Gpipe: Eﬃcient training of giant neural networks using pipeline parallelism. ar Xiv preprint ar Xiv:1811.06965, 2018.

Galin L Jones, Murali Haran, Brian S Caﬀo, and Ronald Neath. Fixed-width output analysis for

markov chain monte carlo. Journal of the American Statistical Association, 101(476):1537 1547, 2006.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint

ar Xiv:1412.6980, 2014.

Alex Krizhevsky and Geoﬀrey Hinton. Learning multiple layers of features from tiny images.

Technical report, Citeseer, 2009.

Harold J. Kushner and G. George Yin. Stochastic Approximation and Recursive Algorithms and

Applications. Springer, 2nd edition, 2003.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr

Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740 755. Springer, 2014.

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie.

Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117 2125, 2017.

Martin A Lindquist and Amanda Mejia. Zen and the art of multiple comparisons. Psychosomatic

medicine, 77(2):114, 2015.

Stephan Mandt, Matthew D Hoﬀman, and David M Blei. Stochastic gradient descent as approximate

bayesian inference. The Journal of Machine Learning Research, 18(1):4873 4907, 2017.

Francisco Massa and Ross Girshick. maskrcnn-benchmark: Fast, modular reference implementation

of Instance Segmentation and Object Detection algorithms in Py Torch. https://github.com/ facebookresearch/maskrcnn-benchmark, 2018. Accessed: [Insert date here].

John H Mc Donald. Handbook of biological statistics, volume 2. 2009.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture

models. ar Xiv preprint ar Xiv:1609.07843, 2016.

Georg Ch. Pﬂug. On the determination of the step size in stochastic quasigradient methods. Collaborative Paper CP-83-025, International Institute for Applied Systems Analysis (IIASA), Laxenburg, Austria, 1983.

Georg Ch. Pﬂug. Non-asymptotic conﬁdence bounds for stochastic approximation algorithms with

constant step size. Monatshefte für Mathematik, 110:297 314, 1990.

Boris T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR

Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964.

Tom Schaul, Sixin Zhang, and Yann Le Cun. No more pesky learning rates. In International

Conference on Machine Learning, pages 343 351, 2013.

David L Streiner. Unicorns do exist: A tutorial on proving the null hypothesis. The Canadian

Journal of Psychiatry, 48(11):756 761, 2003.

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,

Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1 9, 2015.

Tijmen Tieleman and Geoﬀrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running

average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26 31, 2012.

M. T. Wasan. Stochastic Approximation. Cambridge University Press, 1969.

Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal

value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pages 4148 4158, 2017.

Sho Yaida. Fluctuation-dissipation relations for stochastic gradient descent. ar Xiv preprint ar Xiv:1810.00004, 2018.

Jian Zhang and Ioannis Mitliagkas. Yellowﬁn and the art of momentum tuning. ar Xiv preprint

ar Xiv:1706.03471, 2017.