# random_function_descent__bc63285d.pdf

Random Function Descent

Felix Benning University of Mannheim felix.benning@uni-mannheim.de

Leif Döring University of Mannheim leif.doering@uni-mannheim.de

Classical worst-case optimization theory neither explains the success of optimization in machine learning, nor does it help with step size selection. In this paper we demonstrate the viability and advantages of replacing the classical convex function framework with a random function framework. With complexity O(n3d3), where n is the number of steps and d the number of dimensions, Bayesian optimization with gradients has not been viable in large dimension so far. By bridging the gap between Bayesian optimization (i.e. random function optimization theory) and classical optimization we establish viability. Speciﬁcally, we use a stochastic Taylor approximation to rediscover gradient descent, which is scalable in high dimension due to O(nd) complexity. This rediscovery yields a speciﬁc step size schedule we call Random Function Descent (RFD). The advantage of this random function framework is that RFD is scale invariant and that it provides a theoretical foundation for common step size heuristics such as gradient clipping and gradual learning rate warmup.

1 Introduction

Cost function minimization is one of the most fundamental mathematical problems in machine learning. Gradient-based methods, popular for this task, require a step size, typically chosen using established heuristics. This article aims to deepen the theoretical understanding of these heuristics and proposes a new algorithm based on this insight.

Classical optimization theory uses L-smoothness, which limits the rate of change of the gradient by L, to provide some convergence guarantees for learning rates smaller than 1/L [e.g. 38]. As this theory is based on an upper bound (the worst case), the learning rate 1/L is naturally much more conservative than necessary on average. Even if L was known, this learning rate would therefore be impractical. Since line search algorithms typically require access to full cost function evaluations, the ﬁeld of machine learning (ML) therefore relies heavily on step size heuristics [e.g. 48, 49, 42, 20]. To investigate these heuristics, we introduce new ideas based on a random function perspective.

While automatic step size selection in the convex function framework is possible [11], convexity is generally only satisﬁed asymptotically and locally. So the understanding of the initial stages of optimization, which includes the warmup heuristic [20], greatly beneﬁts from a framework which also admits non-convex functions. This objective is achieved by the random function framework we investigate.

Many successful algorithms in computer science are signiﬁcantly slower in the worst case than in the average case based on a probabilistic framework (e.g. Quicksort [23] or the simplex algorithm [e.g. 6]). On random quadratic functions the average case behavior of ﬁrst order optimizers is already being investigated by the ML community [e.g. 58, 43, 33, 12, 9, 40, 41]. Interested in the landscape of high dimensional random functions as a model for spin glasses , the physics community independently started studying the average case of optimization as well [e.g. 4, 15, 37, 51, 24], albeit not geared for ML algorithms.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Average case analysis fundamentally requires a prior distribution over possible cost functions. The evaluations seen so far then result in a posterior over the cost of other parameter inputs. Using this posterior for optimization is called Bayesian optimization (BO) [e.g. 32, 47, 16, 2], which is best known in the context of low dimensional optimization (e.g. hyperparameter tuning) in the ML community. BO is treated like a zero order method for low dimensional problems due to the O(n3) complexity for the covariance matrix inversion of the n evaluations seen so far, which increases to O(n3d3) when gradient information is included [e.g. 35, 55], where d is the input dimension of our cost function. This limits classic BO to relatively small dimensions even under sparsity considerations [e.g. 45, 39].

While the BO algorithms developed in the random function framework might not have been viable in high dimension so far, due to their computational complexity, this framework is already used to explain the high relative frequency of saddle points in high dimension [10] and to explain the highly predictable progress optimizers make on high dimensional cost functions [5].

In this work we bridge the gap between BO and (computationally viable) gradient based methods, derived from the ﬁrst Taylor approximation, with the introduction of a stochastic Taylor approximation based on a forgetful BO posterior. The optimization method Random Function Descent (RFD), resulting from the minimization of this stochastic Taylor approximation, coincides with a speciﬁc form of gradient descent which establishes its viability in high dimension. The advantages of its BO heritage are scale invariance and an explicit step size schedule, which illuminates the inner workings of step size heuristics such as gradient clipping [42] and gradual learning rate warmup [20].

Our contributions and outline The main goal of this paper is to demonstrate the viability and advantages of replacing the classical convex function framework with a random function framework. Theorem 4.2 is the main theoretical result establishing viability (computatability and scalable complexity) for a given covariance model. Section 6 is concerned with practical estimation of the covariance model and viability is demonstrated with a practical example in the MNIST case study (Section 7). The advantages of this approach are scale invariance (Advantage 2.3) and an explicit step size schedule, which does not require expensive tuning and explains existing ML heuristics such as warmup (cf. Section 5.2). This explanation of the initial stage of optimization could never be delivered by the convex framework, because the convexity assumption is not fulﬁlled initially so it can at best explain asymptotic behavior.

Sec. 2 We motivate a stochastic Taylor approximation and RFD and prove its scale-invariance. Sec. 3 We brieﬂy motivate and discuss the common distributional assumptions in BO. Sec. 4 We establish the connection between RFD and gradient descent. Sec. 5 We investigate the step size schedule suggested by RFD. In particular we

0. calculate explicit formulas for the step size schedules resulting from common covariance models (Table 1, Sec. C), 1. analyze the general asymptotic behavior (Sec. 5.1), 2. discuss how RFD explains gradient clipping and learning rate warmup (Sec. 5.2), Sec. 6 We develop a non-parametric variance estimation method, which is robust with respect to the choice of covariance kernel. Finally, we present an extension of RFD to mini-batch losses. Sec. 7 We conduct a case study on the MNIST dataset. Sec. 8 We discuss extensions (see also Sec. E) and limitations.

2 The random function descent algorithm

The classic derivation of gradient descent [e.g. 38, p. 29], adds an L-smoothness based trust bound to the ﬁrst Taylor approximation, T[J(θ) | J(w), J(w)], of the cost function J around w resulting in the gradient step

L J(w) = argmin θ T[J(θ) | J(w), J(w)] + L

Our unusual notation for the Taylor approximation T[J(θ) | J(w), J(w)] is meant to highlight the connection to the stochastic Taylor approximation we deﬁne below.

Deﬁnition 2.1 (Stochastic Taylor approximation). We deﬁne the ﬁrst order stochastic Taylor approximation of a random (cost) function1 J around w by the conditional expectation

E[J(θ) | J(w), J(w)].

This is the best L2 approximation [30, Cor. 8.17] of J(θ) provided ﬁrst order knowledge of J at w.

Figure 1: The stochastic Taylor approximation naturally contains a trust bound in contrast to the classical one. Here J is a Gaussian random function (with covariance as in Equation (11), with length scale s = 2 and variance σ2 = 1). The ribbon represents two conditional standard deviations around the conditional expectation.

We call this the stochastic Taylor approximation because this approximation only makes use of derivatives in a single point. While the standard Taylor approximation is a polynomial approximation, the stochastic Taylor approximation is the best approximation in an L2 sense and already mean-reverting by itself, i.e. it naturally incorporates covariance-based trust (cf. Figure 1). While L-smoothness-based trust guarantees that the gradient still points in the direction we are going (for learning rates smaller 1/L), covariance based trust tells us whether the derivative is still negative on average. Minimizing the stochastic Taylor approximation is therefore optimized for the average case. Since convergence proofs for gradient descent typically rely on an improvement guarantee, proving convergence is signiﬁcantly harder in the average case and we answer this question only partially in Corollary 5.3. Deﬁnition 2.2 (Random Function Descent RFD). Select wn+1 as the minimizer2 of the ﬁrst order stochastic Taylor approximation

wn+1 := argmin w E[J(w) | J(wn), J(wn)].

Properties of RFD Before we make RFD more explicit in Section 4, we discuss some properties which are easier to see in the abstract form.

First, observe that RFD is greedy and forgetful in the same way gradient descent is greedy and forgetful when derived as the minimizer of the regularized ﬁrst Taylor approximation, or the Newton method as the minimizer of the second Taylor approximation. This is because the Taylor approximation only uses derivatives from the last point wn (forgetful), and we minimize this approximation (greedy). Since momentum methods retain some information about past gradients, they are not as forgetful. We therefore expect a similar improvement could be made for RFD in the future.

Second, it is well known that classical gradient descent with exogenous step sizes (and most other ﬁrst order methods) lack the scale invariance property of the Newton method [e.g. 21, 13]. Scale invariance means that scaling the input parameters w or the cost itself (e.g. by switching from the mean squared error to the sum squared error) does not change the points selected by the optimization method.

Advantage 2.3 (Scale invariance). RFD is invariant to additive shifts and positive scaling of the cost J. RFD is also invariant with respect to transformations of the parameter input of J by differentiable bijections whose Jacobian is invertible everywhere (e.g. invertible linear maps).

While invariance to bijections of inputs is much stronger than the afﬁne invariance offered by the Newton method, non-linear bijections will typically break the isotropy assumption of the following

1Remark on terminology: stochastic process [e.g. 53], random ﬁeld [e.g. 1] and random function [e.g. 36] are all synonyms. However the latter seems most descriptive of random variables in the set of functions. Gaussian processes are naturally Gaussian stochastic processes, i.e. Gaussian random functions. To better distinguish random functions from deterministic functions, we use bold letters to denote random functions (as the usual convention of capitalizing random variables often clashes with other conventions for functions). 2we ignore throughout the main body that argmin could be set-valued and that the wn would be random variables (cf. Section D.1.1 for a formal approach).

section which makes RFD explicit. This invariance should therefore be viewed as an opportunity to look for the bijection of inputs which ensures isotropy (e.g. a whitening transformation). The discussion of geometric anisotropy in Section E.1 is conducive to build an understanding of this.

3 A distribution over cost functions

It is impossible to make average case analysis explicit without a distribution over functions, so we use the canonical distributional assumption of Bayesian optimization [e.g. 16, 55, 44], isotropic Gaussian random functions . This assumption was also used in the high dimensional setting by Dauphin et al. [10] to argue that saddle points are much more common than minima in high dimension, which is often cited to explain why second order methods are uncommon in machine learning.

To motivate isotropy, we note that in average case analysis the uniform distribution is popular, since it weighs all problem instances equally (e.g. all possible permutations in sorting). Isotropy is such a uniformity assumption, which essentially requires P(J = J) = P(J = J φ) , for all isometries φ. In other words, the probability that our cost function is equal to J is equal to the probability that it is equal to a shifted and turned version of J, given by J φ.

Since the probability of any single realization of a cost function J is zero, the equation we put in quotes is mathematically unsound. The formal deﬁnition follows below.

Deﬁnition 3.1 (Isotropy). A random function J is called isotropic if its distribution stays the same under isometric transformations of its input, i.e. for any isometry φ we have

If J is Gaussian, isotropy is well known [e.g. 44, 1] to be equivalent to the condition that there exists µ R and a function C : R R such that for all w, w Rd the expectation and covariance are

E[J(w)] = µ, Cov(J(w), J( w)) = C w w 2

For these isotropic Gaussian random functions we use the notation J N(µ, C).

We discuss generalizations to isotropy in Section F and E.1, but for ease of exposition we retain the (stationary) isotropy assumption throughout the main body. Note that the Gaussian assumption can be statistically tested in practice (cf. Figure 4), but it is also straightforward to reproduce our results with the best linear unbiased estimator (BLUE) (Section E.3) in place of the conditional expectation to remove the Gaussian assumption. We ﬁnally want to highlight that, in contrast to the uniformity assumption on ﬁnite sets, isotropic Gaussian random functions leave us with a family of plausible distributions. It is therefore necessary to estimate µ and C, which is the topic of Section 6.

4 Relation to gradient descent

While we were able to deﬁne RFD abstractly without any assumptions on the distribution PJ of the random cost J, an explicit calculation requires distributional assumptions and we have motivated isotropic Gaussian random functions in Section 3 for this purpose. The assumption of isotropy allows for an explicit version of the stochastic Taylor approximation which then immediately leads to an explicit version of RFD.

Lemma 4.1 (Explicit ﬁrst order stochastic Taylor approximation). For J N(µ, C), the ﬁrst order stochastic Taylor approximation is given by

E[J(w d) | J(w), J(w)] = µ + C d 2

C(0) (J(w) µ) C d 2

C (0) d, J(w) .

The explicit version of RFD follows by ﬁxing the step size η = d and optimizing over the direction ﬁrst.

Theorem 4.2 (Explicit RFD). Let J N(µ, C), then RFD coincides with gradient descent

wn+1 = wn η n J(wn) J(wn) ,

Table 1: RFD step size (cf. Figure 2 and Eq. (11), (13), (14) for the formal deﬁnitions of the models). In particular, s is the length scale in all covariance models.

Model RFD step size η for J(w) µ A-RFD

General case with Θ = J(w)

µ J(w) J(w) = µ Θ 0

3 sΘ 0.58s 1 3s2Θ

2(1+ζ) with ζ :=

5 3sΘ. 0.72s 3 5s2Θ

Squaredexponential s2 r µ J(w)

2 2 +s2 J(w) 2+ µ J(w)

2 J(w) s s2Θ

Rational quadratic β s β Root η

sΘ η + (1 + β)η2 + β

where the RFD step sizes are given by

η n := argmin η R

C(0) (J(wn) µ) η C η2

C (0) J(wn) . (1)

While the descent direction is a universal property for all isotropic Gaussian random functions, it follows from (1) that the step sizes depend much more on the speciﬁc covariance structure. In particular it depends on the decay rate of the covariance acting as the trust bound. Remark 4.3 (Scalable complexity). While Bayesian optimization typically has computational complexity O(n3d3) in number of steps n and dimensions d [55, 45], RFD under the isotropy assumption has the same computational complexity as gradient descent (i.e. O(nd)). Remark 4.4 (Step until the given information is no longer informative). While L-smoothness-based trust prescribes step sizes that guarantee the slope to point downwards over the entire step, RFD prescribes steps which are exactly large enough that the gradient is no longer correlated to the previously observed evaluation. This is because the ﬁrst order condition demands

0 != E[J(w) | J(wn), J(wn)] = E[ J(w) | J(wn), J(wn)]. And for measurable functions φ : Rd+1 R such that Φ = φ(J(wn), J(wn)) is sufﬁciently integrable, Φ is then uncorrelated from i J(w) by the ﬁrst order condition

Cov( i J(w), Φ) = E h E[ i J(w) | J(wn), J(wn)] | {z } =0

(Φ E[Φ]) i = 0.

5 The RFD step size schedule

While classical theory leads to learning rates , RFD suggests step sizes applied to normalized gradients representing the actual length of the step size. In the following we thus make the distinction

wn+1 = wn hn |{z} learning rate J(wn) = wn ηn |{z} step size

J(wn) J(wn) .

To get a better feel for the step sizes suggested by RFD, it is enlightening to divide (1) by µ J(wn) which results in a minimization problem

η := η (Θ) := argmin η qΘ(η) for qΘ(η) := C η2

C(0) η C η2

C (0) Θ, (2)

which is only parametrized by the gradient cost quotient

i.e. η n = η (Θn). This minimization problem can be solved explicitly for the most common [44, ch. 4] differentiable isotropic covariance models, see Table 1, Figure 2 and Appendix C for details.

Figure 2: RFD step sizes as a function of Θ = J(w)

µ J(w) assuming scale s = 1 (cf. Table 1). ARFD (Deﬁnition 5.1) is plotted as dashed lines. A-RFD of the rational quadratic coincides with A-RFD of the squared exponential covariance.

Figure 2 can be interpreted as follows: At the start of optimization, the cost should be roughly equal to the average cost µ J(w), so the gradient cost quotient Θ is inﬁnite and the step sizes are therefore given by η ( ) (also listed in its own column in Table 1). As we start minimizing, the difference µ J(w) becomes positive. Towards the end of minimization this difference no longer changes as the cost no longer decreases. I.e. towards the end the gradient cost quotient Θ is roughly linear in the gradient J(w) . The derivative d dΘη (0) of η (Θ) at zero then effectively results in a constant asymptotic learning rate.

5.1 Asymptotic learning rate

To explain the claim above, note that the gradient cost quotient Θ converges to zero towards the end of optimization, because the gradient norm converges to zero. A ﬁrst order Taylor expansion of η would therefore imply

η (Θ) η (0) + d dΘη (0)Θ = d dΘ η (0)

µ J(w) | {z } asymptotic learning rate

assuming η (0) = 0 and differentiability of η , which is a reasonable educated guess based on the the examples in Figure 2. But since the RFD step sizes η are abstractly deﬁned as an argmin, it is necessary to formalize this intuition for general covariance models. First, we deﬁne asymptotic step sizes as an object towards which we can prove convergence. Then we prove convergence, proving they are well deﬁned. In addition, we obtain a more explicit formula for the asymptotic learning rate. Deﬁnition 5.1 (A-RFD). We deﬁne the step sizes of asymptotic RFD (A-RFD) to be the minimizer of the second order Taylor approximation T2qΘ of qΘ around zero

ˆη(Θ) := argmin η T2qΘ(η) = C(0) C (0)Θ = C(0) C (0)(J(w) µ) | {z } asymptotic learning rate

In the following we prove that these are truly asymptotically equal to the step sizes η of RFD. Proposition 5.2 (A-RFD is well deﬁned). Let J N(µ, C) and assume there exists η0 > 0 such that the correlation for larger distances η η0 are bounded smaller than 1, i.e. C(η2/2)

C(0) < ρ (0, 1). Then the step sizes of RFD are asymptotically equal to the step sizes of A-RFD, i.e. ˆη(Θ) η (Θ) as Θ 0.

Note that the assumption is essentially always satisﬁed, since the Cauchy-Schwarz inequality implies

2 = Cov(J(w), J( w)) p

Var(J(w)) Var(J( w)) = C(0), where equality requires the random variables to be almost surely equal [30]. If the random function is not periodic or constant, this will generally be strict. In the proof, this requirement is only needed to ensure that η is not very large. The smallest local minimum of qΘ is always close to ˆη even without this assumption (which ensures it is a global minimum).

Figure 2 illustrates that η 0 should imply Θ 0, resulting in a weak convergence guarantee. Corollary 5.3. Assume η 0 implies Θ 0, the cost J is bounded, has continuous gradients and RFD converges to some point w . Then w is a critical point and the RFD step sizes η are asymptotically equal to ˆη.

For the squared exponential covariance model we formally prove that η is strictly monotonously increasing in Θ and thus η 0 implies Θ 0 (Prop. C.3). The bounded and continuous gradients assumptions are almost surely satisﬁed for all sufﬁciently smooth covariance functions [cf. 1], where three times differentiable is more than enough smoothness.

5.2 RFD step sizes explain common step size heuristics

Asymptotically, RFD suggests constant learning rates, similar to the classical L-smooth setting. We thus deﬁne these asymptotic learning rates (as the limit of the learning rates hn of iteration n) to be

h := C(0) C (0)(J(w ) µ), (3)

where J(w ) is the cost we reach in the limit. If we used these asymptotic learning rates from the start, step sizes would become too large for large gradients, as RFD step sizes exhibit a plateau (cf. Figure 2). To emulate the behavior of RFD with a piecewise linear function, we could introduce a cutoff whenever our step size exceeds the initial step size η ( ), i.e.

wn+1 = wn min n h , η ( ) J(wn)

o J(wn). (gradient clipping)

At this point we have rediscovered gradient clipping [42]. Since the rational quadratic covariance has the same asymptotic learning rate h for every β, its parameter β controls the step size bound η ( ) of gradient clipping (cf. Table 1, Figure 2).

Pascanu et al. [42] motivated gradient clipping with the geometric interpretation of movement towards a wall placed behind the minimum. This suggests that clipping should happen towards the end of training. This stands in contrast to a more recent step size heuristic, (linear) warmup [20], which suggests smaller learning rates at the start (i.e. h0 = η ( ) J(w0) ) and gradual ramp-up to the asymptotic learning rate h . In other words, gradients are not clipped due to some wall next to the minimum, but because the step sizes would be too large at the start otherwise. Goyal et al. [20] further observe that constant warmup (i.e. a step change of learning rates akin to gradient clipping) performs worse than gradual warmup. Since RFD step sizes suggest this gradual increase, we argue that they may have discovered RFD step sizes empirically (also cf. Figure 3).

6 Mini-batch loss and covariance estimation

Since we do not have access to evaluations of the cost J in practice, we need to prove some results about stochastic losses ℓi before we can apply RFD in practice. For this, assume that we have independent identically distributed (iid) data Xi independent of the true relationship f drawn from Pf resulting in labels Yi = f(Xi) + ςi, where we have added independent iid noise ςi, resulting in loss and cost ℓi(w) := ℓ w, (Xi, Yi) and J(w) := E[ℓi(w) | f].

In this setting we conﬁrm (cf. Lemma D.9), that the stochastic approximation errors

ϵi(w) := ℓi(w) J(w)

are independent conditional on the true relationship f. In particular they (and all their derivatives) are uncorrelated and also uncorrelated from J. It follows that mini-batch losses

i=1 ℓi(w) = J(w) + 1

i=1 ϵi(w) (4)

have variance

Var(Lb(w)) = Var(J(w)) + 1

b Var(ϵ1(w)) isotropy = C(0) + 1

b Cϵ(0), (5)

where we assume J N(µ, C) and ϵi N(0, Cϵ) in the last equation for simplicity. But this step did not yet require the distributional Gaussian assumption beyond the mean and variance.

6.1 Variance estimation

Recall that the asymptotic learning rate h in Equation (3) only depends on C(0) and C (0). So if we estimate these values, we are certain to get the right RFD step sizes asymptotically without knowing the entire covariance kernel C.

Equation (5) reveals that for Zb := (Lb(w) µ)2 we have

E[Zb] = β0 + 1

bβ1 i.e. Zb = β0 + 1

bβ1 + noise

with bias β0 = C(0) and slope β1 = Cϵ(0). So a linear regression on samples ( 1

bk , Zbk)k n allows for the estimation of β0 and β1. Using the Gaussian assumption from (5), the variance of Zb is the (centered) fourth moment of Lb, which is given by

σ2 b := Var(Zb) = E[Z4 b ] E[Z2 b ]2 = 2 Var(Lb(w))2 = 2(β0 + 1

In particular the variance of Zb depends on the batch size b. The linear regression is therefore heteroskedastic. Weighted least squares (WLS) [e.g. 28, Theorem 4.2] is designed to handle this case, but for its application the variance of Zb is needed. Since β0, β1 are the parameters we wish to estimate, we ﬁnd ourselves in the paradoxical situation that we need β to obtain β. Our solution to this problem is to start with a guess of β0, β1, apply WLS to obtain a better estimate and repeat this bootstrapping procedure until convergence. Since all Zb have the same underlying cost J, we sample the parameters w randomly to reduce their covariance (details in Sec. B).

The same procedure can be applied to obtain C (0), where the counterpart of Equation (5) is given by

Var( i Lb(w)) = Var( i J(w)) + 1

b Var( iϵ1(w)) isotropy = (C (0) + 1

Remark 6.1. Under the isotropy assumption the partial derivatives are iid, so the expectation of Lb(w) 2 = Pd i=1( i Lb(w))2 is this variance scaled by d. In particular the variance needs to scale with 1

d to keep the gradient norms (and thus the Lipschitz constant of J) stable. This observation is closely related to isoperimetry [e.g. 7], for details see [5]. Removing the isotropy assumption and estimating the variance component-wise is most likely how adaptive step sizes [e.g. 14, 29], like the ones used by Adam, work (cf. Sec. E.1).

Batch size distribution Before we can apply linear regression to the samples ( 1

bk , Zbk)k n, it is necessary to choose the batch sizes bk. As this choice is left to us, we calculate the variance of our estimator ˆβ0 of β0 explicitly (Lemma B.2), in order to minimize this variance subject to a sample budget α over the selection of batch sizes

min n,b1,...,bn Var( ˆβ0) s.t.

| {z } samples used

Since this optimization problem is very difﬁcult to solve, we rephrase it in terms of the empirical distribution of batch sizes νn = 1

n Pn i=1 δbi. Optimizing over distributions is still difﬁcult, but we explain in Section B.1 how to heuristically arrive at the parametrization

ν(b) exp λ1 1

σ2 b λ2b , b N

where the parameters λ1, λ2 0 can then be used to optimize (6). Due to our usage of σ2 b this has to be bootstrapped.

Covariance estimation While the variance estimates above ensure correct asymptotic learning rates, we motivated in Section 5.2 that asymptotic learning rates alone would result in too large step sizes at the beginning. We therefore use the estimates of C(0) and C (0) to ﬁt a covariance model, effectively acting as a gradient clipper while retaining the asymptotic guarantees. Note that covariance models with less than two parameters are generally fully determined by these values.

6.2 Stochastic RFD (S-RFD)

It is reasonable to ask whether there is a stochastic gradient descent -like counterpart to the gradient descent -like RFD. The answer is yes, and we already have all the required machinery.

Extension 6.2 (S-RFD). For loss J N(µ, C) and stochastic errors ϵi iid N(0, Cϵ) we have

argmin d E[J(w d) | Lb(w), Lb(w)] = η (Θ) Lb(w)

with the same step size function η as for RFD, but modiﬁed Θ

Θ = C (0) C (0) + 1

b C ϵ(0) C(0) + 1

b Cϵ(0) C(0) Lb(w) µ Lb(w).

Note, that our non-parametric covariance estimation already provides us with estimates of Cϵ(0) and C ϵ(0), so no further adaptions are needed. The resulting asymptotic learning rate is given by

h = C(0) + 1

b Cϵ(0) (C (0) + 1

b C ϵ(0))(Lb(w ) µ). (7)

7 MNIST case study

For our case study we use the negative log likelihood loss to train a neural network [3, M7] on the MNIST dataset [34]. We choose this model as one of the simplest state-of-the-art models at the time of selection, consisting only of convolutional layers with Re LU activation interspersed by batch normalization layers and a single dense layers at the end with softmax activation. Assuming isotropy, we estimate µ, C(0) and C (0) as described in Section 6.1 and deduce the parameters σ2 and s of the respective covariance model (more details in Section B). We then use the step sizes listed in Table 1 for the squared exponential and rational quadratic covariance in our RFD algorithm.

In Figure 3, RFD is benchmarked against step size tuned Adam [29] and stochastic gradient descent (SGD). Even with early stopping, their tuning would typically require more than 1 epoch worth of samples, in contrast to RFD (Section A.1.1). We highlight that A-RFD performs signiﬁcantly worse than either of the RFD versions which effectively implement some form of learning rate warmup. This is despite the RFD learning rates converging to the asymptotic one within one epoch (ca. 30 out of 60 steps per epoch). The step sizes on the other hand are (up to noise) monotonously decreasing. This stands in contrast to the wall next to the minimum motivation of gradient clipping.

Code availability: Our implementation of RFD can be found at https://github.com/ Felix Benning/pyrfd and the package can also be installed from Py PI via pip install pyrfd .

8 Limitations and extensions

To cover the vast amount of ground that lays between the formulation of a general average case optimization problem and the prototype of a working optimizer with theoretical backing ,

1. we used the common [16, 52, 55, 10] isotropic and Gaussian distributional assumption for J, 2. we used very simple covariance models for the actual implementation, 3. we used WLS in our variance estimation procedure despite the violation of independence.

Since RFD is deﬁned as the minimizer of an average instead of an upper bound making it more risk afﬁne it naturally loses the improvement guarantee driving classical convergence proofs. It is therefore impossible to extend classical optimization proofs and new mathematical theory must be developed. This risk-afﬁnity can also be observed in its comparatively large step sizes (cf. Fig. 3 and Sec. A). On CIFAR-100 [31], the step sizes were too large and it is an open question whether assumptions were violated or whether RFD is simply too risk-afﬁne. But since the variance of random functions vanishes asymptotic with high dimension [5] we highly suspect the former (cf. Remark E.5).

Future work will therefore have to target these assumptions. Some of the assumptions were already simpliﬁcations for the sake of exposition, and we deferred their relaxation to the appendix. The Gaussian assumption can be relaxed with a generalization to the BLUE (Sec. E.3), isotropy can be generalized to geometric anisotropies (Sec. E.1) and the risk-afﬁnity of RFD can be reduced with conﬁdence intervals (Sec. E.2). Since simple random linear models already violate stationary isotropy (Sec. F.1), we believe that stationarity is the most important assumption to attack in future work.

9 Conclusion

In this paper we have demonstrated the viability (computability and scalable complexity) and advantages (scale invariance, explainable step size schedule which does not require expensive

0 5 10 15 20 25 30 Epoch

Validation loss

RFD(RQ(beta=1)) A-RFD SGD(lr=1)

Adam(lr=1e-2)

10 4 10 3 10 2 10 1 100 101

Learning rate

Final validation loss

0 20 40 60 80 100 Step

Learning rate

Learning rate

RFD(RQ(beta=1)) A-RFD

0 250 500 750 1000 1250 1500 1750 Step

RFD(RQ(beta=1)) A-RFD SGD(lr=1)

Adam(lr=1e-2)

Figure 3: Training on the MNIST dataset (batch size 1024). Ribbons describe the range between the 10% and 90% quantile of 20 repeated experiments while lines represent their mean. SE stands for the squared exponential (11) and RQ for the rational quadratic (13) covariance. The validation loss uses the test data set, which provides a small advantage to Adam and SGD, as we also use it for tuning.

tuning) of replacing the classical convex function framework with the random function framework. Along the way we bridged the gap between Bayesian optimization (not scalable so far) and classical optimization methods (scalable). This theoretical framework not only sheds light on existing step size heuristics, but can also be used to develop future heuristics.

We envision the following improvements to RFD in the future:

1. The reliability of RFD can be improved by generalizing the distributional assumptions to cover more real world scenarios. In particular we are interested in the generalization to non-stationary isotropy because we suspect that regularization such as weight and batch normalization [46, 25] are used to patch violations of stationarity (cf. Section F). 2. The performance of RFD can also be improved. Since RFD is forgetful while momentum methods retains some information it is likely fruitful to relax the full forgetfulness. Furthermore, we suspect that adaptive learning rates [e.g. 14, 29], such as those used by Adam, can be incorporated with geometric anisotropies (cf. Sec. E.1). Performance could also be further improved by estimating the covariance (locally) online instead of globally at the start. Finally, the implementation itself can be made more performant.

Acknowledgement

We extend our sincere gratitude to our colleagues at the University of Mannheim, with special thanks to Rainer Gemulla and Julie Naegelen for insightful discussions and invaluable feedback. The Experiments in this work were partially carried out on the compute cluster of the state of Baden-Würtemberg (bw HPC).

[1] Robert J. Adler and Jonathan E. Taylor. Random Fields and Geometry. Springer Monographs in Mathematics. New York, NY: Springer New York, 2007. ISBN: 978-0-387-48112-8. DOI: 10.1007/978-0-387-48116-6. [2] Apoorv Agnihotri and Nipun Batra. Exploring Bayesian Optimization . In: Distill 5.5 (202005-05), e26. ISSN: 2476-0757. DOI: 10.23915/distill.00026. [3] Sanghyeon An et al. An Ensemble of Simple Convolutional Neural Network Models for MNIST Digit Recognition. 2020-10-04. DOI: 10.48550/ar Xiv.2008.10400. Pre-published. [4] Antonio Aufﬁnger and Qiang Zeng. Complexity of Gaussian Random Fields with Isotropic Increments . In: Communications in Mathematical Physics 402.1 (2023-08-01), pp. 951 993. ISSN: 1432-0916. DOI: 10.1007/s00220-023-04739-0. [5] Felix Benning and Leif Döring. Gradient Span Algorithms Make Predictable Progress in High Dimension. 2024-10-13. DOI: 10.48550/ar Xiv.2410.09973. Pre-published. [6] Karl Heinz Borgwardt. The Simplex Method: A Probabilistic Analysis. Softcover reprint of the original 1st ed. 1987 edition. Berlin Heidelberg: Springer, 1986-11-01. 282 pp. ISBN: 978-3-540-17096-9. [7] Sebastien Bubeck and Mark Sellke. A Universal Law of Robustness via Isoperimetry . In: Advances in Neural Information Processing Systems. Vol. 34. Virtual Event: Curran Associates, Inc., 2021, pp. 28811 28822. ar Xiv: 2105 . 12806 [cs, stat]. URL: https : / / proceedings . neurips . cc / paper / 2021 / hash / f197002b9a0853eca5e046d9ca4663d5-Abstract.html (visited on 2023-09-22). [8] Youngmin Cho and Lawrence Saul. Kernel Methods for Deep Learning . In: Advances in Neural Information Processing Systems. Vol. 22. Curran Associates, Inc., 2009. URL: https : / / proceedings . neurips . cc / paper / 2009 / hash / 5751ec3e9a4feab575962e78e006250d-Abstract.html (visited on 2023-04-03). [9] Leonardo Cunha et al. Only Tails Matter: Average-Case Universality and Robustness in the Convex Regime . In: Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR, 2022-06-28, pp. 4474 4491. URL: https://proceedings.mlr.press/v162/cunha22a.html (visited on 2023-11-09). [10] Yann N Dauphin et al. Identifying and Attacking the Saddle Point Problem in High Dimensional Non-Convex Optimization . In: Advances in Neural Information Processing Systems. Vol. 27. Montréal, Canada: Curran Associates, Inc., 2014. URL: https://proceedings. neurips.cc/paper/2014/hash/17e23e50bedc63b4095e3d8204ce063b-Abstract. html (visited on 2022-06-10). [11] Aaron Defazio and Konstantin Mishchenko. Learning-Rate-Free Learning by D-Adaptation . In: Proceedings of the 40th International Conference on Machine Learning. International Conference on Machine Learning. PMLR, 2023-07-03, pp. 7449 7479. ar Xiv: 2301.07733 [cs.LG]. URL: https://proceedings.mlr.press/v202/defazio23a.html (visited on 2024-03-31). [12] Percy Deift and Thomas Trogdon. The Conjugate Gradient Algorithm on Well-Conditioned Wishart Matrices Is Almost Deterministic . In: Quarterly of Applied Mathematics 79.1 (202103), pp. 125 161. ISSN: 0033-569X, 1552-4485. DOI: 10.1090/qam/1574. ar Xiv: 1901. 09007 [cs, math]. [13] P. Deuﬂhard and G. Heindl. Afﬁne Invariant Convergence Theorems for Newton s Method and Extensions to Related Methods . In: SIAM Journal on Numerical Analysis 16.1 (1979-02), pp. 1 10. ISSN: 0036-1429. DOI: 10.1137/0716001. [14] John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization . In: The Journal of Machine Learning Research 12 (2011-07-01), pp. 2121 2159. ISSN: 1532-4435. [15] Ahmed El Alaoui, Andrea Montanari, and Mark Sellke. Optimization of Mean-Field Spin Glasses . In: The Annals of Probability 49.6 (2021-11), pp. 2922 2960. DOI: 10.1214/21AOP1519. [16] Peter I. Frazier. Bayesian Optimization . In: Recent Advances in Optimization and Modeling of Contemporary Problems. INFORMS Tut ORials in Operations Research. Phoenix, Arizona, USA: INFORMS, 2018-10, pp. 255 278. ISBN: 978-0-9906153-2-3. DOI: 10.1287/educ. 2018.0188. ar Xiv: 1807.02811 [cs, math, stat].

[17] Fuchang Gao and Lixing Han. Implementing the Nelder-Mead Simplex Algorithm with Adaptive Parameters . In: Computational Optimization and Applications 51.1 (2012-01-01), pp. 259 277. ISSN: 1573-2894. DOI: 10.1007/s10589-010-9329-3. [18] Xavier Glorot and Yoshua Bengio. Understanding the Difﬁculty of Training Deep Feedforward Neural Networks . In: Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics. Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics. Sardinia, Italy: JMLR Workshop and Conference Proceedings, 2010-03-31, pp. 249 256. URL: https://proceedings.mlr.press/v9/ glorot10a.html (visited on 2023-04-11). [19] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016-11-10. 801 pp. ISBN: 978-0-262-33737-3. Google Books: omiv DQAAQBAJ. [20] Priya Goyal et al. Accurate, Large Minibatch SGD: Training Image Net in 1 Hour. ar Xiv, 201804-30. ar Xiv: 1706.02677 [cs]. URL: http://arxiv.org/abs/1706.02677 (visited on 2024-04-02). [21] Anders Hansson. Optimization for Learning and Control. Hoboken, New Jersey: John Wiley & Sons, Inc., 2023. ISBN: 978-1-119-80914-2. [22] Geoffrey Hinton. Neural Networks for Machine Learning . Massive Open Online Course (Coursera). 2012. URL: https://www.cs.toronto.edu/~hinton/coursera_lectures. html (visited on 2021-11-16). [23] C. A. R. Hoare. Quicksort . In: The Computer Journal 5.1 (1962-01-01), pp. 10 16. ISSN: 0010-4620. DOI: 10.1093/comjnl/5.1.10. [24] Brice Huang and Mark Sellke. Tight Lipschitz Hardness for Optimizing Mean Field Spin Glasses . In: 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS). 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS). 2022-10, pp. 312 322. DOI: 10.1109/FOCS54457.2022.00037. [25] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift . In: Proceedings of the 32nd International Conference on Machine Learning. International Conference on Machine Learning. PMLR, 2015-06-01, pp. 448 456. ar Xiv: 1502.03167 [cs]. URL: https://proceedings.mlr. press/v37/ioffe15.html (visited on 2021-10-06). [26] E. T. Jaynes. Information Theory and Statistical Mechanics . In: Physical Review 106.4 (1957-05-15), pp. 620 630. DOI: 10.1103/Phys Rev.106.620. [27] Richard Arnold Johnson and Dean W. Wichern. Applied Multivariate Statistical Analysis. 6th ed. Upper Saddle River, N.J: Pearson College Div, 2007. 767 pp. ISBN: 978-0-13-1877153. [28] Steven M. Kay. Fundamentals of Statistical Signal Processing: Estimation Theory. USA: Prentice-Hall, Inc., 1993-02. 595 pp. ISBN: 978-0-13-345711-7. [29] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization . In: Proceedings of the 3rd International Conference on Learning Representations. ICLR. San Diego, 2015. ar Xiv: 1412.6980. [30] Achim Klenke. Probability Theory: A Comprehensive Course. Universitext. London: Springer, 2014. ISBN: 978-1-4471-5360-3 978-1-4471-5361-0. DOI: 10.1007/978-1-4471-5361-0. [31] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. 2009. URL: https: //www.cs.toronto.edu/%20kriz/learning-features-2009-TR.pdf (visited on 2024-05-21). [32] H. J. Kushner. A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise . In: Journal of Basic Engineering 86.1 (1964-03-01), pp. 97 106. ISSN: 0021-9223. DOI: 10.1115/1.3653121. [33] Jonathan Lacotte and Mert Pilanci. Optimal Randomized First-Order Methods for Least Squares Problems . In: Proceedings of the 37th International Conference on Machine Learning. International Conference on Machine Learning. PMLR, 2020-11-21, pp. 5587 5597. URL: https://proceedings.mlr.press/v119/lacotte20a.html (visited on 2023-11-09). [34] Yann Le Cun, Corinna Cortes, and Christopher J.C. Burges. THE MNIST DATABASE of Handwritten Digits. 2010. URL: http://yann.lecun.com/exdb/mnist/ (visited on 2024-01-24).

[35] Daniel James Lizotte. Practical Bayesian Optimization . Ph D thesis. Alberta, Canada: University of Alberta, 2008. 168 pp. [36] G. Matheron. The Intrinsic Random Functions and Their Applications . In: Advances in Applied Probability 5.3 (1973-12), pp. 439 468. ISSN: 0001-8678, 1475-6064. DOI: 10.2307/ 1425829. [37] Andrea Montanari. Optimization of the Sherrington Kirkpatrick Hamiltonian . In: SIAM Journal on Computing (2021-01-07), FOCS19 1. ISSN: 0097-5397. DOI: 10.1137/20M132016X. [38] Yurii Evgen eviˇc Nesterov. Lectures on Convex Optimization. Second edition. Springer Optimization and Its Applications; Volume 137. Cham: Springer, 2018. ISBN: 978-3-319-91578-4. DOI: 10.1007/978-3-319-91578-4. [39] Misha Padidar et al. Scaling Gaussian Processes with Derivative Information Using Variational Inference . In: Advances in Neural Information Processing Systems. Vol. 34. Curran Associates, Inc., 2021, pp. 6442 6453. URL: https://proceedings.neurips.cc/paper/2021/ hash/32bbf7b2bc4ed14eb1e9c2580056a989-Abstract.html (visited on 2023-05-17). [40] Courtney Paquette et al. Halting Time Is Predictable for Large Models: A Universality Property and Average-Case Analysis . In: Foundations of Computational Mathematics 23.2 (2022-02), pp. 597 673. ISSN: 1615-3383. DOI: 10.1007/s10208-022-09554-y. ar Xiv: 2006.04299 [math, stat]. [41] Elliot Paquette and Thomas Trogdon. Universality for the Conjugate Gradient and MINRES Algorithms on Sample Covariance Matrices . In: Communications on Pure and Applied Mathematics 76.5 (2022-09-01), pp. 1085 1136. ISSN: 1097-0312. DOI: 10.1002/cpa. 22081. [42] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the Difﬁculty of Training Recurrent Neural Networks . In: Proceedings of the 30th International Conference on Machine Learning. International Conference on Machine Learning. Atlanta: PMLR, 2013-05-26, pp. 1310 1318. URL: https://proceedings.mlr.press/v28/pascanu13.html (visited on 2024-04-02). [43] Fabian Pedregosa and Damien Scieur. Acceleration through Spectral Density Estimation . In: Proceedings of the 37th International Conference on Machine Learning. International Conference on Machine Learning. Virtual Event (formerly Vienna): PMLR, 2020-11-21, pp. 7553 7562. URL: https://proceedings.mlr.press/v119/pedregosa20a.html (visited on 2023-11-09). [44] Carl Edward Rasmussen and Christopher K.I. Williams. Gaussian Processes for Machine Learning. 2nd ed. Adaptive Computation and Machine Learning 3. Cambridge, Massachusetts: MIT Press, 2006. 248 pp. ISBN: 0-262-18253-X. URL: http://gaussianprocess.org/ gpml/chapters/RW.pdf (visited on 2023-04-06). [45] Filip de Roos, Alexandra Gessner, and Philipp Hennig. High-Dimensional Gaussian Process Inference with Derivatives . In: Proceedings of the 38th International Conference on Machine Learning. International Conference on Machine Learning. PMLR, 2021-07-01, pp. 2535 2545. URL: https://proceedings.mlr.press/v139/de-roos21a.html (visited on 2023-05-15). [46] Tim Salimans and Durk P Kingma. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks . In: Advances in Neural Information Processing Systems. Vol. 29. Barcelona, Spain: Curran Associates, Inc., 2016. URL: https : / / proceedings . neurips . cc / paper / 2016 / hash / ed265bc903a5a097f61d3ec064d96d2e-Abstract.html (visited on 2023-10-16). [47] Bobak Shahriari et al. Taking the Human Out of the Loop: A Review of Bayesian Optimization . In: Proceedings of the IEEE 104.1 (2016-01), pp. 148 175. ISSN: 1558-2256. DOI: 10.1109/JPROC.2015.2494218. [48] Leslie N. Smith. A Disciplined Approach to Neural Network Hyper-Parameters: Part 1 Learning Rate, Batch Size, Momentum, and Weight Decay. 2018-04-24. DOI: 10.48550/ ar Xiv.1803.09820. ar Xiv: 1803.09820 [cs, stat]. Pre-published. [49] Leslie N. Smith. Cyclical Learning Rates for Training Neural Networks . In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). 2017-03, pp. 464 472. DOI: 10.1109/WACV. 2017.58.

[50] Michael L. Stein. Interpolation of Spatial Data. Springer Series in Statistics. New York, NY: Springer, 1999. ISBN: 978-1-4612-7166-6 978-1-4612-1494-6. DOI: 10.1007/978-1-46121494-6. [51] Eliran Subag. Following the Ground States of Full-RSB Spherical Spin Glasses . In: Communications on Pure and Applied Mathematics 74.5 (2021), pp. 1021 1044. ISSN: 1097-0312. DOI: 10.1002/cpa.21922. [52] Ziyu Wang et al. Bayesian Optimization in a Billion Dimensions via Random Embeddings . In: Journal of Artiﬁcial Intelligence Research 55 (2016-02-19), pp. 361 387. ISSN: 1076-9757. DOI: 10.1613/jair.4806. [53] C.K.I. Williams and D. Barber. Bayesian Classiﬁcation with Gaussian Processes . In: IEEE Transactions on Pattern Analysis and Machine Intelligence 20.12 (1998-12), pp. 1342 1351. ISSN: 1939-3539. DOI: 10.1109/34.735807. [54] Christopher K. I. Williams. Computation with Inﬁnite Neural Networks . In: Neural Computation 10.5 (1998-07-01), pp. 1203 1216. ISSN: 0899-7667. DOI: 10 . 1162 / 089976698300017412. [55] Jian Wu et al. Bayesian Optimization with Gradients . In: Advances in Neural Information Processing Systems. Vol. 30. Curran Associates, Inc., 2017. URL: https://proceedings. neurips.cc/paper/2017/hash/64a08e5f1e6c39faeb90108c430eb120-Abstract. html (visited on 2022-06-02). [56] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. 2017-09-15. DOI: 10.48550/ar Xiv.1708. 07747. ar Xiv: 1708.07747 [cs, stat]. Pre-published. [57] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method . 2012-12-22. ar Xiv: 1212.5701 [cs]. [58] Guodong Zhang et al. Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model . In: Advances in Neural Information Processing Systems. Vol. 32. Curran Associates, Inc., 2019. ar Xiv: 1907 . 04164 [cs, stat]. URL: https : / / proceedings . neurips . cc / paper / 2019 / hash / e0eacd983971634327ae1819ea8b6214-Abstract.html (visited on 2023-11-09).

10 3 10 2 10 1

3.5 Lb(w) batchwise mean ˆµ = 2.73

3 2 1 0 1 2 3

Ordered Values

b=10 b=100 b=1000

0.00 0.02 0.04 0.06 0.08 0.10

(Lb(w) ˆµ)2

batchwise mean squares Var(Lb(w))

0 2 4 6 8 10

Ordered Values

b=10 b=100 b=1000

0.00 0.02 0.04 0.06 0.08 0.10 1/b

batchwise mean

E[ Lb(w) 2]

2.286 2.288 2.290 2.292 2.294 2.296 2.298 Theoretical quantiles 106

Ordered Values

b=10 b=100 b=1000

Figure 4: Visualization of the variance estimation (Section 6.1) with 95%-conﬁdence intervals based on the assumed distribution. Quantile-quantile (QQ) plots of the losses (against a normal distribution), squared losses (against a χ2(1) distribution) and squared gradient norms (against a χ2(d)-distribution) are displayed on the right for a selection of batch sizes.

Appendix: Random Function Descent

A Experiments

A.1 Covariance estimation

In Figure 4 we visualize weighted least squares (WLS) regression of the covariance estimation from Section 6.1. Note, that we sampled much more samples per batch size for these plots than RFD would typically require by itself in order to be able to plot batch-wise means and batch-wise QQ-plots. The batch size distribution we described in Section B.1 would avoid sampling the same batch size multiple times to ensure better stability of the regression and generally requires much fewer samples than were used for this visualization (cf. A.1.1)

We can observe from the QQ-plots on the right, that the Gaussian assumption is essentially justiﬁed for the losses, resulting in a χ2(1) distribution for the squared losses and a χ2(d) distribution for the gradient norms squared. The conﬁdence interval estimate for the squared norms appears to be much too small (it is plotted, but too small to be visible). Perhaps this signiﬁes a violation of the isotropy

10.0 12.5 15.0 17.5 20.0 22.5 25.0 Asymptotic learning rate

0 100000 200000 300000 400000 500000 Sample cost

Figure 5: 20 repeated covariance estimations of model M7 [3] applied to the MNIST dataset. On the left are the resulting asymptotic learning rates (assuming a ﬁnal loss of zero) and on the right are the samples used until the stopping criterion interrupted sampling.

assumption as the variance of

i=1 ( i Lb(w))2

does not appear to be the variance of independent χ2(d) Gaussian random variables, and the independence only follows from the isotropy assumption.

A.1.1 Sampling efﬁciency and stability

To evaluate the sampling efﬁciency and stability of our variance estimation process, we repeated the covariance estimation of the model model M7 [3] applied to the MNIST dataset 20 times (Figure 5). We used a tolerance of tol = 0.3 as a stopping criterion for the estimated relative standard deviation (10).

At this tolerance, the asymptotic learning rate already seems relatively stable (in the same order of magnitude) and the sample cost is quite cheap. The majority of runs (16/20 runs or 80%) required less than 60 000 samples (1 epoch). There was one large outlier which used 500 589 samples. A closer inspection revealed, that after the initial sample to estimate the optimal batch size distribution, it sampled almost exclusively at batch sizes 20 (which was the minimal cutoff to avoid instabilities caused by batch normalization) and batch sizes between 1700 and 1900. It therefore seems like the initial batch of samples caused a very unfavorable batch size distribution which then required a lot of samples to recover from. Our selection of an initial sample size of 6000 might therefore have been too small.

A more extensive empirical study is needed to tune this estimation process, but the process promises to be very sample efﬁcient. Classical step size tuning would train models for a short duration in order to evaluate the performance of a particular learning rate [e.g. 48], but a single epoch worth of samples is very hard to beat.

Our implementation of this process on the other hand is very inefﬁcient as of writing. Piping data of differing batch sizes into a model is not a standard use case. We implement this by repeatedly initializing data loaders, which is anything but performance friendly.

A.2 Other models and datasets

To estimate the effect of the batch size on RFD, we trained the same model (M7 [3]) on MNIST with batch size 128 (Figure 6). We can see that the asymptotic learning rate of S-RFD is reduced at a smaller batch size (cf. Equation 7) but the performance is barely different. Overall, RFD seems to be slightly too risk-afﬁne, selecting larger step sizes than the tuned SGD models.

We also trained a different model (M5 [3]) on the Fashion MNIST dataset [56] with batch size 128 (Figure 7). Since the validation loss increases after epoch 5, early stopping would have been

0 5 10 15 20 25 30 Epoch

Validation loss

SGD(lr=0.1)

Adam(lr=1e-3)

10 4 10 3 10 2 10 1 100 101

Learning rate

Final validation loss

0 20 40 60 80 100 Step

Learning rate

Learning rate

0 2000 4000 6000 8000 10000 12000 14000 Step

SGD(lr=0.1)

Adam(lr=1e-3)

Figure 6: Training model M7 [3] with batch size 128 on MNIST [34].

appropriate. We therefore include Adam with learning rate 10 3, despite Adam with learning rate 10 4 technically performing better at the end of training. We can generally see, that RFD comes very close to tuned performance at the time early stopping would have been appropriate. Again, learning rates seem to be slightly too large (risk-afﬁne) in comparison to tuned SGD.

B Variance estimation in detail

Recall that we are interested in the regression

Zb(w) = (Lb(w) µ)2 β0 + 1

where the variance of Zb is given by

σ2 b = 2(β0 + 1

under the Gaussian assumption on Lb.

More speciﬁcally we for minibatch sizes (bk)k n and parameter vectors (wk)k n we want to sample mini batch losses

L(k) := Lbk(wk) = J(wk) + 1

i=1 ϵk,i(wk)

As the ϵk,i are all conditionally independent and therefore uncorrelated, we have

Cov(L(k), L(l)) = Cov(J(wk), J(wl)) = C wk wl 2

Since the covariance kernel C is typically monotonously falling in the distance of parameters wk wl 2, we want to select them as spaced out as possible to minimize the covariance of L(k) (which is the next best thing to iid samples). Randomly selecting wi with Glorot initialization [18] will ensure a good spread.

Note that Glorot initialization places all parameters approximately on the same sphere. This is because Glorot initialization initializes all parameters independently, therefore their norm is the

0 5 10 15 20 25 30 Epoch

Validation loss

SGD(lr=0.1)

Adam(lr=1e-3)

Adam(lr=1e-4)

10 4 10 3 10 2 10 1 100 101

Learning rate

Final validation loss

0 20 40 60 80 100 Step

Learning rate

Learning rate

0 2000 4000 6000 8000 10000 12000 14000 Step

SGD(lr=0.1)

Adam(lr=1e-3)

Adam(lr=1e-4)

Figure 7: Model M5 [3] trained on Fashion MNIST [56] with batch size 128.

sum of independent squares, which converges by the law of large numbers due to the normalization Glorot uses. Since stationary isotropy and non-stationary isotropy coincides on the sphere, this is an important effect to consider (cf. Section F).

What is left, is the selection of the batch sizes bk.

B.1 Batch size distribution

Since we plan to use the data set ( 1

bk , L(k))k n for weighted least squares (WLS) regression and do not have a selection process for the batch sizes bk yet, it might be appropriate to select the batch sizes bk in such a way, that the variance of our estimator ˆβ0 of β0 is minimized. Here we choose Var(ˆβ0) and not Var(ˆβ1) as our optimization target, since β0 = C(0) is used to ﬁt the covariance model, while β1 = Cϵ(0) is only required for S-RFD. Without deeper analysis β0 therefore seems to be more important.

Optimization over n parameters bk is quite difﬁcult, but we can simplify this optimization problem by considering the empirical batch size distribution

Using a random variable B distributed according to νn, the total number of sample losses can then be expressed as

k=1 bk = n E[B] = samples used.

Under an (unrealistic) independence assumption, the variance Var(ˆβ0) also has a simple representation in terms of νn (Lemma B.2). We now want to minimize this variance subject to compute

constraint α limiting the number of sample losses we can use resulting in the optimization problem

Var( ˆβ0) = 1

σ2 B ] | {z } variance of ZB

E[ 1 σ2 BB2 ]

E h 1 σ2 B 1

B E[ 1 Bσ2 BE[1/σ2 B]] 2i

| {z } inverse of the spread of 1

s.t. n E[B] α. (8)

where we recall that σ2 B is the variance of ZB. So the inverse of the expectation of its inverse is roughly the average variance of ZB. The second half is the fraction of a weighted second moment divided by the weighted variance. Unless the mean is at zero, the former will be larger. In particular we want a spread of data otherwise the variance would be zero. This is in some conﬂict with the variance of ZB.

But ﬁrst, let us get rid of n. Note that we would always increase n until our compute budget is used up, since this always reduces variance. So we approximately have n E[B] = α. Thus

Var( ˆβ0) = E[B]

E[ 1 σ2 BB2 ]

E h 1 σ2 B 1

B E[ 1 Bσ2 BE[1/σ2 B]] 2i

Since α is now just resulting in a constant factor, it can be assumed to be 1 without loss of generality. Over batch size distributions ν we therefore want to solve the minimization problem

min ν E[B] E[ 1

σ2 B ] | {z } moments

E[ 1 σ2 BB2 ]

E h 1 σ2 B 1

B E[ 1 Bσ2 BE[1/σ2 B]] 2i

| {z } spread

Example B.1 (If we did not require spread). If we were not concerned with the variance of batch sizes, we could select a constant B = b. Then it is straightforward to minimize the moments factor manually

min b E[b] E[ 1

σ2 b ] = bσ2 b = 2b(β0 + 1

resulting in 1

β1 . In other words: If we did not have to be concerned with the spread of B there is one optimal selection to minimize the ﬁrst factor. But in reality we have to trade-off this target with the spread of B.

To ensure a good spread of data, we use the maximum entropy distribution for B, with the moment constraints

n average sample usage

σ2 B ] θ ZB variance

which capture the ﬁrst factor. Maximizing entropy under moment constraints is known [26] to result in the Boltzmann (a.k.a. Gibbs) distribution

ν(b) = P(B = b) exp λ1 1 σ2 b λ2b ,

where λ1, λ2 depend on the momentum constraints. We can now forget the origin of this distribution and use λ1, λ2 as parameters for the distribution ν in Equation (9) to get close to its minimum. In practice we use a zero order black box optimizer (Nelder-Mead [17]). One could calculate the expectations of (9) under this distribution explicitly and take manual derivatives with respect to λi to investigate this further, but we wanted to avoid getting too distracted by this tangent.

We also use the estimated relative standard deviation

as a stopping criterion for sampling. Without extensive testing we found a tolerance of rel_std < tol = 0.3 to be reasonable, cf. Section A.1.1.

Lemma B.2 (Variance of ˆβ0 in terms of the empirical batch size distribution). Assuming independence of the samples (( 1

bk ), Zbk)k n, the variance of ˆβ0 is given by

Var( ˆβ0) = 1

E[ 1 σ2 BB2 ]

E h 1 σ2 B 1

B E[ 1 Bσ2 BE[1/σ2 B]] 2i

where B is distributed according to the empirical batch size distribution νn = 1

n Pn k=1 δbk.

Proof. With the notation σ2 k = σ2 bk to describe the variance of Zbk it follows from [cf. 28, Thm. 4.2] that the variance of the estimator ˆβ of β using n samples is given by

Var(ˆβ) = (HT C 1H) 1

k 1 (σkbk)2 (P

k 1 σ2 kbk )2

k 1 (σkbk)2 P

k 1 σ2 kbk P

k 1 σ2 kbk P

σ2 1 ... σ2 n

1 1 b1 ... 1 1 bn

In particular we have

k 1 σ2 kb2 k P

k 1 σ2 kb2 k (P

k 1 σ2 kbk )2 .

With the help of θ := P

j 1 σ2 j and λk := 1 σ2 kθ, we can reorder the divisor. For this note that since the λk sum to 1 we have X

j λj 1 bj + X

k λk 1 b2 k 2 X

k λk 1 b2 k X

Where the above is essentially the well known statement E[(Y E[Y ])2] = E[Y 2] E[Y ]2 for an appropriate selection of Y . This implies that our divisor is given by a weighted variance

1 σ2 kb2 k X

where it is only necessary to plug in the deﬁnition of θ to see the right term is exactly our divisor. Expanding both the enumerator as well as the divisor by 1

n, we obtain

Var(ˆβ0) = 1

k 1 σ2 kb2 k 1 n P

Since θ = n E[1/σ2 B] for B 1

n Pn k=1 δbk and λk = 1 nσ2 k E[1/σ2 B], the above can thus be written as

Var( ˆβ0) = 1

n 1 E[1/σ2 B]

E[ 1 σ2 BB2 ]

E h 1 σ2 B 1

B E[ 1 Bσ2 BE[1/σ2 B]] 2i,

which proves our claim.

C Covariance models

In this section we calculate the step sizes of the covariance models listed in Table 1 and plotted in Figure 2. Additionally we calculate the asymptotic learning rate of A-RFD and prove an Assumption of Corollary 5.3 for the squared exponential covariance (Prop. C.3).

C.1 Squared exponential

The squared exponential covariance function is given by

2 = σ2 exp x y 2

Note that σ2 will play no role in the step sizes of RFD due to its scale invariance (cf. Advantage 2.3).

Theorem C.1. Let J N(µ, C) where C is the squared exponential covariance function (11), then we have

J(w) = argmin d E[J(w d) | J(w), J(w)]

with RFD step size

η = s2 J(w) q µ J(w)

2 2 + s2 J(w) 2 + µ J(w)

Proof. The covariance function C is of the form

C(h) = σ2e h

By Equation (2)

η = argmin η

2 ) C(0) η C ( η2

2 ) C (0) Θ.

where Θ = J(w)

µ J(w) . We calculate

C(0) η C η2

C (0) Θ = e η2

2s2 (1 + ηΘ).

This results in the ﬁrst order condition

2s2 (1 + ηΘ) e η2

2s2 Θ = e η2

s2 (η2Θ + η s2Θ).

Since the exponential can never be zero, we have to solve a quadratic equation. Its solution results in

η (Θ) = q 1

2Θ 2 + s2 1 2Θ. (12)

At this point we could stop, but the result is numerically unstable as it suffers from catastrophic cancellation. To solve this issue we set x = 1 2Θ and reorder

x2 + s2 x = ( p

x2 + s2 + x

x2 + s2 + x = x2 + s2 x2

x2 + s2 + x .

Re-substituting x = 1 2Θ = µ J(w) 2 J(w) , we ﬁnally get

η = s2 q µ J(w)

2 J(w) 2 + s2 + µ J(w) 2 J(w) = s2 J(w) q µ J(w)

2 2 + s2 J(w) 2 + µ J(w)

Proposition C.2 (A-RFD for the Squared Exponential Covariance). If J is isotropic with squared exponential covariance (11), then the step size of A-RFD is given by

µ J(w) J(w) ,

Proof. By Deﬁnition 5.1 of A-RFD and Θ = J(w)

µ J(w) we have

ˆη(Θ) = C(0) C (0)Θ = σ2 exp(0)

s2 exp(0) J(w) µ J(w) = s2 J(w)

Proposition C.3. If J is isotropic with squared exponential covariance (11), then the RFD step sizes are strictly monotonously increasing in Θ.

Proof. Since we know that Θ 0 implies η ˆη 0 strict monotonicity of η in Θ is sufﬁcient to show that η 0 also implies Θ 0. So we take the derivative of (12) resulting in

d dΘη = 1 1

which is greater zero for all Θ > 0.

C.2 Rational quadratic

The rational quadratic covariance function is given by

2 = σ2 1 + x y 2

β/2 β > 0. (13)

It can be viewed as a scale mixture of the squared exponential and converges to the squared exponential in the limit β [44, p. 87]. Theorem C.4 (Rational Quadratic). For J N(µ, C) where C is the rational quadratic covariance we have for Θ = J(w)

µ J(w) 0 that the RFD step size is given by

sΘ η + (1 + β)η2 + β

The unique root of the polynomial in η can be found either directly with a formula for polynomials of third degree (e.g. using Cardano s method) or by bisection as it is contained in [0, 1/ 1 + β].

Proof. By Theorem 4.2 we have

η = argmin η C η2

C(0) η C η2

for C(x) = σ2(1 + 2x

βs2 ) β/2. We therefore need to minimize

β/2 η 1 + η2

Substitute in η := η βs, then the ﬁrst order condition is

0 != f ( η) = d

d η 1 + η2 β/2 + p

βs η 1 + η2 β/2 1 Θ

Dividing both sides by βsΘ we get

0 = f ( η) βsΘ = β

2 (1 + η2) β

2 12 η 1 βsΘ + (1 + η2) β

2 2 h 1 + η2 ( β

2 + 1)2 η2i

= (1 + η2) β

2 2 h β η 1 βsΘ(1 + η2) [1 η2(1 + β)] i

sΘ η+(1+β) η2+ β

Since Θ 0 and β > 0 all coefﬁcients of the polynomial are positive except for the shift. The polynomial thus starts out at 1 in zero and only increases from there. Therefore there exists a unique positive critical point which is a minimum.

At the point η = 1 + β the quadratic term is already larger than 1 so the polynomial is positive and we have passed the root. The minimum is therefore contained in the interval [0, 1 + β].

After ﬁnding the minimum in η we return to η by multiplication with βs.

Proposition C.5 (A-RFD for the Rational Quadratic Covariance). If J is isotropic with rational quadratic covariance (13), then the step size of A-RFD is given by

µ J(w) J(w) .

Proof. C(x) = σ2(1 + 2x

βs2 ) β/2 implies by Deﬁnition 5.1 of A-RFD and Θ = J(w)

ˆη(Θ) = CJ(0)

C J(0)Θ = σ2(1 + 0) β/2

s2 (1 + 0) β/2 1 J(x) µ J(x) = s2 J(x)

Deﬁnition C.6. The Matérn model parametrized by s > 0, ν 0, σ2 0 is given by

2 = σ2 21 ν

where Kν is the modiﬁed Bessel function.

For ν = p + 1

2 with p N0, it can be simpliﬁed [cf. 44, sec. 4.2.1] to

(2p k)! (p k)!k! 2

The Matérn model encompasses Rasmussen and Williams [44]

the nugget effect for ν = 0 (independent randomness)

the exponential model for ν = 1

2 (Ornstein-Uhlenbeck process)

the squared exponential model for ν with the same scale s and variance σ2.

The random functions induced by the Matérn model are a.s. ν -times differentiable Rasmussen and Williams [44], i.e. the smoothness of the model increases with increasing ν. While the exponential covariance model with ν = 1

2 results in a random function which is not yet differentiable, larger ν result in increasing differentiability. As differentiability starts with ν = 3

2 and we have a more explicit formula for ν = p + 1

2 the cases ν = 3

2 and ν = 5

2 are of particular interest.

[F]or ν 7/2, in the absence of explicit prior knowledge about the existence of higher order derivatives, it is probably very hard from ﬁnite noisy training examples to distinguish between values of ν 7/2 (or even to distinguish between ﬁnite values of ν and ν , the smooth squared exponential, in this case) [44, p. 85].

Theorem C.7. Assuming J N(µ, C) is a random function where C is the Matérn covariance such that ν = p + 1

2 with p {1, 2}. Then the RFD step is given for Θ := J(w)

µ J(w) 0 by

5 (1 ζ) + p

4 + (1 + ζ)2

2(1 + ζ) ζ :=

Proof. We deﬁne C(η) := C( η2

2 ), which implies

C (η) = C η2

or conversely

η C (η). (15)

By Theorem 4.2, we need to calculate

η = argmin η C( η2

2 ) C(0) η C ( η2

2 ) C (0) Θ. (16)

Discarding σ w.l.o.g. due to scale invariance (Advantage 2.3), we have in the case p = 1

The derivative is then given by

which implies using (15) C η2

3 s η (17) We therefore need to minimize (16) which is given by

argmin η 1 +

3 s η η exp

3 s η Θ = argmin η 1 + (

3 s + Θ)η exp

The ﬁrst order condition is

3 s + Θ)η (

which (divided by Θ and noting that the exponential can never be zero) is equivalent to

3 sΘ + 1)η 1

reordering for η implies

It is also not difﬁcult to see that this is the point where the derivative switches from negative to positive (i.e. a minimum).

Let us now consider the case p = 2, i.e.

5 s η + 5 3s2 η2 exp

which results in C (η) = 5

5 s η , i.e. by (15) C η2

5 s η . (18) We therefore need to minimize (16) which is given by 1 +

5 s η + 5 3s2 η2 η 1 +

5 s +Θ η+ 5

The ﬁrst order condition results in

5 s + Θ η + 5

5 s + Θ + 2 5

Dividing everything by Θ and using ζ :=

5 3sΘ we get

5 s η + ζ + 1

Taking a closer look at the sign changes of the derivative it becomes obvious, that the positive root is the minimum, i.e.

5 s η != (1 ζ) + p

(1 ζ)2 + 4(1 + ζ) 2(1 + ζ) = (1 ζ) + p

4 + (1 + ζ)2

Proposition C.8 (A-RFD for the Matérn Covariance). If J is isotropic with Matérn covariance (14) such that ν = p + 1

2, then the step size of A-RFD for p {1, 2} is given by

3 J(x) µ J(x)

5 J(x) µ J(x)

Proof. Noting Θ = J(x)

µ J(x) , we have by Deﬁnition 5.1 of A-RFD for p = 1

ˆη = C(0) C (0)Θ (17) = s2

and in the case p = 2

ˆη = C(0) C (0)Θ (18) = 3s2

In this section we prove all the claims made in the main body.

D.1 Section 2: Random function descent

D.1.1 Formal RFD

As we mentioned in a footnote at the deﬁnition of RFD, the fact that the parameters become random variables as they are selected by random gradients poses some mathematical challenges which would have been distracting to address in the main body. In following paragraphs leading up to Deﬁnition D.1 we introduce and discuss the probability theory required to provide a mathematically sound deﬁnition.

For a ﬁxed cost distribution PJ and any weight vectors w and w the conditional distribution

E[J( w) | J(w), J(w)]

is by its axiomatic deﬁnition a (J(w), J(w))-measurable random variable. By the factorization lemma [30, Cor. 1.9.7], there therefore exists a measurable function (j, g) 7 ϕw, w(j, g) such that the following equation holds almost surely

ϕw,θ(J(w), J(w)) = E[J( w) | J(w), J(w)]. (19)

Since it is possible to calculate ϕw, w explicitly in the Gaussian case (cf. G.1), the function

ΦPJ : (Rd R Rd) R (w, j, g) 7 argmin w ϕw, w(j, g),

which implements some tie-breaker rules for set valued argmin is measurable when J is Gaussian and its covariance function is sufﬁciently smooth. To prove measurability in the general case is a difﬁcult problem of its own, which we do not attempt to solve here, since we would not utilize the conditional expectation outside of the Gaussian case anyway (cf. Section E.3). For deterministic w, we therefore have

ΦPJ(w, J(w), J(w)) = argmin w ϕw, w(J(w), J(w))

(19) = argmin w E[J( w) | J(w), J(w)].

So if the parameter vectors wn were deterministic, our formal deﬁnition of RFD and our initial deﬁnition would coincide. But for random weights W (19) stops to hold in general3, i.e.

ϕW, w(J(W), J(W)) = E[J( w) | J(W), J(W)].

If this equation does not need to hold, we similarly have in general

ΦPJ(W, J(W), J(W)) = argmin w E[J( w) | J(W), J(W)].

So the following deﬁnition is not just a restatement of the original deﬁnition of RFD.

Deﬁnition D.1 (Formal RFD). For a Gaussian random cost function J, we deﬁne the RFD algorithm with starting point W0 = w0 Rd by

Wn+1 := ΦPJ(Wn, J(Wn), J(Wn))

This is what we effectively do in Theorem 4.2 under the additional isotropy assumption, where we calculate the argmin under the assumption that w is deterministic (i.e. we determine ΦPJ), before we plug-in the random variables Wn to obtain Wn+1. Similarly this is how the step size prescriptions of RFD actually work. We ﬁrst assume deterministic weights and later plug the random variables into our formulas. For this reason, we avoided large letters indicating random variables for parameters w in the main body.

D.1.2 Scale invariance

Advantage 2.3 (Scale invariance). RFD is invariant to additive shifts and positive scaling of the cost J. RFD is also invariant with respect to transformations of the parameter input of J by differentiable bijections whose Jacobian is invertible everywhere (e.g. invertible linear maps).

Before we get to the proof, let us quickly formulate the statement in mathematical terms. Let wn be the parameters selected optimizing J starting in w0 and wn the parameters selected by the same optimizer optimizing J starting in w0.

If we apply afﬁne linear scaling to cost J such that J(w) = a J(w) + b and start optimization in the same point, i.e. w0 = w0, then we expect a scale invariant optimizer to select

If we scale inputs on the other hand (or more generally map them with a bijection φ), then we expect for J := J φ and starting point w0 = φ 1(w0), that this relationship is retained by a scale invariant optimizer, i.e. wn = φ 1(wn).

Why do we use a different starting point? As an illustrating example, assume that φ maps miles into kilometers. Then J accepts miles, while J accepts kilometers. Then we have to map the initial starting point w0 of J measured in kilometers into miles w0. φ 1 is precisely this transformation from kilometers into miles. A scale invariant optimizer should retain this relation, i.e. no matter if the input is measured in miles or kilometers the same points are selected.

Proof. The following proof will be split into three parts. The ﬁrst two parts of the proof will address a more general audience and ignore the mathematical subtleties we discussed in Section D.1.1. In the third part we explain to the interested probabilists how to resolve these issues.

1. Invariance with regard to afﬁne linear scaling

Let J(w) := a J(w) + b where a > 0 and b R and assume w0 = w0. With the induction start given, we only require the induction step to prove wn = wn.

3E.g. consider the random variable

W = argmin J 1J( w)>0 + argmax J 1J( w)<0.

In this case, J(W) is much more informative of J( w) than J(w) at some deterministic w.

For the induction step, we assume this equation holds up to n. Since φ(x) = ax + b is a measurable bijection, the sigma algebra4 generated by

( J(wn), J(wn)) = (φ J(wn), a J(wn))

is therefore equal to the sigma algebra generated by (J(wn), J(wn)). This implies

wn+1 = argmin w E[ J(w) | J( wn), J( wn)]

induction = argmin w E[ J(w) | J(wn), J(wn)]

sigma alg. = argmin w E[ J(w) | J(wn), J(wn)]

linearity = argmin w a E[J(w) | J(wn), J(wn)] + b

monotonicity = argmin w E[J(w) | J(wn), J(wn)]

def. = wn+1

Where we have used the linearity of the conditional expectation and the strict monotonicity of φ(x) = ax + b.

2. Invariance with regard to certain input bijections

Let φ be a differentiable bijection whose jacobian is invertible everywhere and assume J := J φ. Since φ is a bijection, φ(M) is the domain of J whenever M is the domain of J.

For a starting point w0 φ(M) we now assume w0 = φ 1(w0) M and are again going to prove the claim wn = φ 1(wn). by induction. Assume that we have this claim up to n. Then we have by induction J( wn) = J φ(φ 1(wn)) = J(wn) (21)

and J( wn) = wn(J φ( wn)) = φ ( wn)( J)(φ( wn)) = φ ( wn) J(wn).

Since φ ( wn) is invertible by assumption, the sigma algebras generated by ( J( wn), J( wn)) and J(wn), J(wn) are identical. But this results in the induction step

wn+1 = argmin w M E[ J(w) | J( wn), J( wn)]

sigma alg. = argmin w M E[ J(w) | J(wn), J(wn)]

def. = argmin w M E[J φ(w) | J(wn), J(wn)]

= φ 1 argmin θ φ(M) E[J(θ) | J(wn), J(wn)]

where we simply optimize over θ = φ(w) instead of w and correct the argmin at the end.

3. Addressing the subtleties

In equation (20) we have really proven for deterministic w

ΦP J(w, J(w), J(w)) = ΦPJ(w, J(w), J(w)).

But this implies with the induction assumption Wn = Wn

Wn+1 = ΦP J( Wn, J( Wn), J( Wn)) ind. = ΦPJ(Wn, J(Wn), J(Wn)) = Wn+1.

4if you are unfamiliar with sigma algebras read them as information .

Similarly we have proven in (22) that

ΦP J φ 1(w), J(φ 1(w)), J(φ 1(w)) = φ 1 ΦPJ(w, J(w), J(w)) .

By the induction assumption W = φ 1(Wn), this implies

Wn+1 = ΦP J( Wn, J( Wn), J( Wn))

ind. = ΦP J φ 1(Wn), J(φ 1(Wn)), J(φ 1(Wn))

= φ 1 ΦPJ(Wn, J(Wn), J(Wn))

= φ 1(Wn+1).

D.2 Section 4: Relation to gradient descent

Lemma 4.1 (Explicit ﬁrst order stochastic Taylor approximation). For J N(µ, C), the ﬁrst order stochastic Taylor approximation is given by

E[J(w d) | J(w), J(w)] = µ + C d 2

C(0) (J(w) µ) C d 2

C (0) d, J(w) .

Proof. (J(w), J(w), J(w d)) is a Gaussian vector for which the conditional distribution is well known. It is only necessary to calculate the covariance matrix. The key ingredient here is to observe that J(w), 1J(w), . . . , d J(w) are all independent, trivializing matrix inversion.

More formally, by Lemma G.2 we have

Cov J(w) J(w)

= C(0) C (0)Id d

Cov J(w d), J(w) J(w)

2 ) C ( d 2

By Theorem G.1 we therefore know that

E[J(w d) | J(w), J(w)] = µ +

2 ) C ( d 2

!T C(0) C (0)Id d

1 J(w) µ J(w)

which immediately yields the claim.

Theorem 4.2 (Explicit RFD). Let J N(µ, C), then RFD coincides with gradient descent

wn+1 = wn η n J(wn) J(wn) ,

where the RFD step sizes are given by

η n := argmin η R

C(0) (J(wn) µ) η C η2

C (0) J(wn) . (1)

Proof. The explicit version of RFD follows essentially by ﬁxing the step size η = d and optimizing over the direction ﬁrst. With Lemma 4.1 we have

min d E[J(w d) | J(w), J(w)]

= min η 0 min d: d =η µ + C η2

C(0) (J(w) µ) C η2

C (0) d, J(w)

= min η 0 µ + C η2

C(0) (J(w) µ) C η2

max d: d =η d, J(w) C ( η2

2 ) C (0) 0

min d: d =η d, J(w) C ( η2

2 ) C (0) < 0.

By Lemma G.3 and Corollary G.4 the maximizing or minimizing step direction is then given by

d(η) = η J(w)

Where it is typically to be expected, that we have a positive sign. Since that depends on the covariance though, we avoid this problem with the following argument: Since η only appears as η2 in the remaining equation, we can optimize over η R in the outer minimization instead of over η 0 to move the sign into the step size η and set without loss of generality

d(η) = η J(w)

Since d(η), J(w) = η J(w) the remaining outer minimization problem over the step size is then given by

min η R C η2

C(0) (J(w) µ) η C η2

C (0) J(w) ,

Its minimizer is by deﬁnition the RFD step size as given in the Theorem.

D.3 Section 5: RFD-step sizes

Proposition D.2 (Tayloring the step size optimization problem). The second order Taylor approximation of the step size optimization problem

qΘ(η) = C η2

C(0) η C η2

around zero is given by

T2qΘ(η) = 1 ηΘ + η2 C (0)

2C(0) minimized by ˆη := argmin η T2qΘ(η) = C(0) C (0)Θ.

Furthermore, the Taylor residual is bounded by q(η) T2q(η) η3c0 η

with c0 = 1

2 max{supθ [0,1] |C (θ)|, |C (0)|}( 1 C(0) + 1 |C (0)|) < .

Proof. Using the Taylor approximation with the mean value reminder for C, we get

2 = C(0) + C (0) η2

2 = C (0) + C (θ1) η2

for some θ1, θ2 [0, η2

2 ]. This implies

q(η) 1 + C (0)

| {z } =:T2qΘ(η)

By the following error the optimistically deﬁned T2qΘ(η) is really the second Taylor approximation (which can be conﬁrmed manually, but we deduce it by arguing that its residual is in O(η3)). More speciﬁcally,

q(η) T2q(η) η3 sup θ [0, η2

2 ] |C (θ)|

2C(0) η 4 +

sup θ [0, η2

2 ] |C (θ)|

2|C (0)| Θ Lem. D.8

It is easy to see for J(w) < µ that T2q(η) is a convex parabola due to C (0) < 0. We thus have

ˆη := argmin η T2qΘ(η) = C(0) C (0)Θ.

Theorem D.3 (Details of Proposition 5.2). Let J N(µ, C) and assume there exists η0 > 0 such that the correlation for larger distances η η0 are bounded smaller than 1, i.e. C(η2/2)

C(0) < ρ (0, 1). Then there exists K, Θ0 > 0 such that for all Θ < Θ0

ˆη(Θ) 1 + KΘ.

In particular we have η (Θ) ˆη(Θ) as Θ 0 or equivalently as ˆη 0.

Proof. This follows immediately from Lemma D.4, Lemma D.5 and Lemma D.6.

Corollary 5.3. Assume η 0 implies Θ 0, the cost J is bounded, has continuous gradients and RFD converges to some point w . Then w is a critical point and the RFD step sizes η are asymptotically equal to ˆη.

Proof. Assuming RFD converges, its step sizes η converge to zero. But this implies Θ 0 by assumption, i.e.

Since J(w) is bounded, this implies J(w) 0 and by continuity the of the gradient, it is zero in its limit. Thus we converge to a stationary point. The asymptotic equality follows by Lemma D.4 and Lemma D.5, as we know η converges so we do not require the assumptions of Lemma D.6.

D.3.1 Locating the Minimizer

In the following we want to rule out locations for the RFD step size η by proving qΘ(η) > qΘ(ˆη) for a wide range of η. For this endeavour the relative position of the step size η relative to ˆη is a useful re-parametrization η := η(λ) = λˆη.

Due to ˆη = C(0) C (0)Θ we obtain

T2qΘ(η) = 1 ηΘ + η2

C(0) = 1 + λ( λ

On the other hand we have for the bound

|qΘ(η) T2qΘ(η)| λ3ˆη3c0 λ C(0) 4|C (0)| + 1 Θ

Since ˆη = η(1) we thus obtain

qΘ(η) qΘ(ˆη)

ˆηΘ T2qΘ(η) |qΘ(η) T2qΘ(η)| T2qΘ(ˆη) |qΘ(ˆη) T2qΘ(ˆη)|

ˆη2c0 h λ3 λ C(0) 4|C (0)| + 1 +

C(0) 4|C (0)| + 1 i

2(1 λ)2 ˆη2c0 h λ3 λ C(0) 4|C (0)| + 1 + C(0) 4|C (0)| + 1 i . (23)

This equation will be the basis of a number of lemmas ruling out various step sizes as minimizers.

Lemma D.4 (Ruling out small step sizes). If the step size is (much) smaller than the asymptotic step size ˆη = ˆη(Θ), then it can not be a minimizer. More speciﬁcally

ˆη [0, 1 c1Θ) = qΘ(η) > qΘ(ˆη)

where c1 := 2 C(0)

c0 C(0) 4|C (0)| + 1 < .

Proof. Here we consider the case η ˆη, i.e. λ [0, 1]. By (23) we have

qΘ(η) qΘ(ˆη)

2(1 λ)2 ˆη2c0 h λ3 λ C(0) 4|C (0)| + 1 + C(0) 4|C (0)| + 1 i

2(1 λ)2 2ˆη2c0 C(0) 4|C (0)| + 1

for which (1 λ)2 > 4ˆη2c0 C(0) 4|C (0)| + 1

is sufﬁcient or equivalently

c0 C(0) 4|C (0)| + 1 = 1 Θ 2 C(0)

c0 C(0) 4|C (0)| + 1

| {z } =:c1

So for λ [0, 1 Θc1) we have qΘ(η) > qΘ(ˆη).

Lemma D.5 (Ruling out medium sized step sizes as minimizer). For c2 = 2c1 and Θ Θ0 := 1 5c1 , we have η

ˆη 1 + c2Θ, 1 c2Θ = qΘ(η) > qΘ(ˆη)

Proof. Here we consider the case λ 1, i.e. η > ˆη. Again starting with (23) we get

2(1 λ)2 ˆη2c0 h λ3 λ C(0) 4|C (0)| + 1 + C(0) 4|C (0)| + 1 i

2(λ 1)2 2λ4ˆη2c0 C(0) 4|C (0)| + 1

λ 1 > 2λ2ˆη

c0 C(0) 4|C (0)| + 1 = c1Θλ2

or equivalently λ 1 c1Θλ2 > 0 is sufﬁcient. Note that this is a concave parabola in λ. So it is positive between its zeros which are characterized by c1Θλ2 λ + 1 = 0. They are thus given by

λ1/2 = 1 1 4c1Θ

So whenever λ (λ1, λ2) we have that qΘ(η) > qΘ(ˆη). In particular for 4c1Θ 1 or equivalently Θ 1 4c1 we have

λ2 1 2c1Θ = 1 c2Θ To get a bound on λ1 note that the original equation was essentially

λ 1 + c1Θλ2

with equality for λ = λ1, if Θ is reduced, the inequality remains, which implies that λ1 is decreasing with Θ. So assuming the inequality is satisﬁed for a particular λ e.g. λ =

2 which requires

2 1 + 2c1Θ Θ

then we know that λ1

2 for all smaller Θ. This implies for Θ Θ0 = 1 5c1

λ1 = 1 + c1Θλ2 1 1 + 2c1 |{z} c2 Θ.

Lemma D.6 (Ruling out large step sizes as minimizer). If there exists step size η0 > 0 such that the correlation is bounded by some ρ < 1, i.e.

C(0) ρ (0, 1),

for larger step sizes η η0, then there exist Θ0 > 0 such that for all Θ < Θ0 η

ˆη 1 + c2Θ, = q(η) > q(ˆη),

where c2 is the constant from Lemma D.5.

Proof. The upper bound 1 c2Θ in Lemma D.5 is only due to the loss of precision of the Taylor approximation. To remove it, we take a closer look at the actual qΘ itself. We have the following bound for our asymptotic minimum

Θ T2qΘ(ˆη) + |qΘ(ˆη) T2qΘ(ˆη)|

2 ˆη + ˆη3 c0 C(0) 4|C (0)| + 1

| {z } =:c3

Which means we have for

qΘ(η) qΘ(ˆη)

C (0) ˆη3c3

C (0) ˆη3c3

C (0) ˆη3c3

where we use the assumption that there exists ρ (0, 1) such that ρ C η2

C(0) for all η η0 and the fact that we only need to consider η 1 c2Θ (due to Lemma D.5) which allows a translation of η0 into some maximal Θ0. Note that ˆη Θ vanishes as Θ 0, so eventually the term (1 M) 1

Θ dominates. Selecting Θ0 small small enough is thus sufﬁcient to cover everything that is not already covered by Lemma D.5.

D.3.2 Technical bounds

Lemma D.7 (Bound on the ﬁrst derivative of the covariance).

sup η 0 |C η2

Proof. Since we have

Cov(Dv J(x), J(y)) = C x y 2

we have for a standardized vector v = 1 and x y = ηv by Cauchy-Schwarz

2 η| = | Cov(Dv J(x), J(y))| C.S. p

Var(Dv J(x)) Var(J(y)) = p

As the bound is independent of η this yields the claim.

Lemma D.8 (Bound on the second derivative of the covariance).

sup θ 0 |C (θ)| max n sup θ [0,1] |C (θ)|, |C (0)| o

Proof. Note that

Cov(Dv J(x), Dw J(y)) = C x y 2

2 x y, v x y, w C x y 2

Selecting v, w as orthonormal vectors (e.g. v = e1, w = e2) and x y := η(v + w) for some η > 0 results in x y 2 = 2η2 and thus by the Cauchy-Schwarz inequality C (η2)η2 = Cov(Dv J(x), Dw J(y)) C.S. p

Var(Dv J(x)) Var(Dw J(y)) = p

This implies the claim.

D.4 Section 6: Stochastic loss

Lemma D.9. The stochastic approximation errors

ϵi(w) := ℓi(w) J(w)

are identically distributed, centered random functions, which are independent conditional on f. In particular, E[ϵi(w)ϵj( w)] = E[ϵi(w)ϵj( w) | f] = 0 j = i.

Proof. The ϵi are independent random functions conditional on f, since for any n N, any bounded measurable functions h and g

E h h ϵi(w1), . . . , ϵi(wn) g ϵj(w1), . . . , ϵj(wn) | f i

= h h ϵi(w1), . . . , ϵi(wn) E h g ϵj(w1), . . . , ϵj(wn) | f, Xi, ςi i

( ) = E g(ϵj(w1),...,ϵj(wn))|f

= E h h ϵi(w1), . . . , ϵi(wn) | f i E h g ϵj(w1), . . . , ϵj(wn) | f i ,

where ( ) uses the fact that ϵj does not depend on the independent Xi, ςi. Since almost by deﬁnition

E[ϵi | f] = E[ℓ( , Xi, Yi) | f] J( ) = 0,

the stochastic approximation errors are thus uncorrelated

E[ϵiϵj] = E h E[ϵiϵj | f] i = E h E[ϵi | f]E[ϵj | f] i = 0.

Extension 6.2 (S-RFD). For loss J N(µ, C) and stochastic errors ϵi iid N(0, Cϵ) we have

argmin d E[J(w d) | Lb(w), Lb(w)] = η (Θ) Lb(w)

with the same step size function η as for RFD, but modiﬁed Θ

Θ = C (0) C (0) + 1

b C ϵ(0) C(0) + 1

b Cϵ(0) C(0) Lb(w) µ Lb(w).

Proof. Since ϵi are conditionally independent between each other and to J, as entire functions, the same holds true for ϵi. As all the mixed covariances disappear we have

Cov Lb(w) Lb(w)

= Cov J(w) J(w)

i=1 Cov ϵi(w) ϵi(w)

= C(0) C (0)Id d

Cϵ(0) C ϵ(0)Id d

b Cϵ(0) C (0) + 1

b C ϵ(0) Id d.

by Lemma G.2. If you want to break up the ﬁrst step we recommend considering individual entries of the covariance matrix to convince yourself that all the mixed covariances disappear. Together with the fact

Cov J(w d), Lb(w) Lb(w)

= Cov J(w d), J(w) J(w)

i=1 Cov J(w d), ϵi(w) ϵi(w)

2 ) C ( d 2

The rest is analogous to Lemma 4.1 and Theorem 4.2, so we only sketch the remaining steps.

Applying Theorem G.1 as in Lemma 4.1 we obtain a stochastic version of the stochastic Taylor approximation ( stochastic2 Taylor approximation perhaps?)

E[J(w d) | Lb(w), Lb(w)] = µ + C d 2

b Cϵ(0)(Lb(w) µ) C d 2

b C ϵ(0) d, Lb(w) .

Minimizing this subject to a constant step size as in Theorem 4.2 results in

η = argmin η R

b Cϵ(0)(Lb(w) µ) η C d 2

b C ϵ(0) Lb(w)

= argmin η R C d 2

C(0) η C d 2

b C ϵ(0) C(0) C(0) + 1

b Cϵ(0) Lb(w) µ Lb(w),

where we divided the term by C(0) C(0)+ 1

b Cϵ(0) 1 µ Lb(w) 0 to obtain the last equation. The claim follows by deﬁnition of η (Θ) and our redeﬁnition of Θ.

E Extensions

In this section we present a few possible extensions to Theorem 4.2, which are all composable, i.e. it is possible to combine these extensions without any major problems (including S-RFD, i.e. Extension 6.2).

E.1 Geometric anisotropy/Adaptive step sizes

In this section, we discuss the generalization of isotropy to geometric anisotropies [50, p. 17], which provide good insights into the inner workings of adaptive learning rates (e.g. Ada Grad [14] and Adam [29]). Deﬁnition E.1 (Geometric Anisotropy). We say a random function J exhibits a geometric anisotropy , if there exists an invertible matrix A such that J(x) = g(Ax) for some isotropic random function g.

This implies that the expectation of J is still constant (E[J(x)] = E[g(Ax)] = µ) and the covariance function of J is given by

Cov(J(x), J(y)) = Cov(g(Ax), g(Ay)) = C A(x y) 2

= C x y 2 AT A 2

where Σ is the norm induced by x, y Σ := x, Σy for some strictly positive deﬁnite matrix Σ = AT A. Here (24) characterizes the set of random functions with a geometric anisotropy in the Gaussian case, because for an J with such a covariance we can always obtain an isotropic g by g(x) := J(A 1x). This is the whitening transformation we suggest looking for in order to ensure isotropy in the context of scale invariance (Section 2).

An important observation is, that Theorem F.2 implies that J is still stationary, so the distribution of J is still invariant to translations. If stationarity is a problem, this is therefore not the solution. But geometric anisotropies are a beautiful model to explain preconditioning and adaptive step sizes. For this, we ﬁrst determine the RFD steps.

Extension E.2 (RFD steps under geometric anisotropy). Let J be a Gaussian random function which exhibits a geometric anisotropy A and is based on an isotropic random function g N(µ, C). Then the RFD steps are given by

η Σ 1 J(w) Σ 1 J(w) Σ = argmin d E[J(w d) | J(w), J(w)]

η = argmin η qΘ(η) where Θ = Σ 1 J(w) Σ

Proofsketch. There are two ways to see this. Either we apply scale invariance (Advantage 2.3) directly to translate the steps on g into steps on J. Alternatively one can manually retrace the steps of the proof. Details in Subsection E.1.1

The step direction is therefore

and Σ 1 acts as a preconditioner. So how would one obtain Σ? As it turns out the following holds true (by Lemma G.2)

E[ J(w) J(w)T ] = AT E[ g(w) g(w)T ]A = AT ( C (0)I)A = C (0)Σ

In their proposal of the ﬁrst adaptive method, Ada Grad, Duchi et al. [14] suggest to use the matrix

k=1 J(wk) J(wk)T ,

which is basically already looking like an estimation method of Σ. They then restrict themselves to diag(Gt) due to the computational costs of a full matrix inversion. This results in entry-wise ( adaptive ) learning rates. Later adaptive methods like RMSProp [22], Ada Delta [57] and in particular Adam [29] replace this sum with an exponential mean estimate, i.e. in the case of Adam the decay rate β2 is used to get an exponential moving average

vt = β2vt 1 + (1 β2) diag( J(wt) J(wt)T ) = β2vt 1 + (1 β2)( J(wt))2.

They then take the expectation

E[vt] = E h (1 β2)

k=1 βt k 2 J(wk)2i = E h (1 β2)

k=1 βt k 2 J(wk)2i = (1 βt 2) E[ J(xt)2] | {z } diag(Σ)

So ˆvt = vt/(1 βt 2) in the Adam optimizer is essentially an estimator for diag(Σ). It is noteworthy, that Kingma and Ba [29] already used the expectation symbol. This is despite the fact, that they did not yet model the optimization objective J as a random function.

We can not yet explain why they then use the square root of their estimate diag(Σ) 1/2 instead of diag(Σ) 1 itself. This might have something to do with the fact that the estimation of Gt happens online and the J(wk) are therefore highly correlated. Another reason might be that the inverse of an estimator has different properties than the estimator itself. Finally, the fact that only the diagonal is used might also be the reason, if the preconditioner diag(Σ) 1/2 is simply better when we restrict ourselves to diagonal matrices.

E.1.1 Proof of Extension E.2

Since the application of scale invariance provides no intuition, we provide a proof which retraces some of the steps of the original proof.

Recall, that for an isotropic random function g we have the stochastic Taylor approximation

E[g(w d) | g(x), g(x)] = µ + C d 2

C(0) (g(w) µ) + C d 2

C (0) d, g(w)

This implies for a random function with geometric anisotropy J(w) = g(Aw) that

E[J(w d) | J(w), J(w)] = E[g(A(w d)) | g(Aw), g(Aw)]

= µ + C Ad 2

C(0) (g(Aw) µ) C Ad 2

C (0) Ad, g(Aw)

= µ + C d 2 Σ 2

C(0) (J(w) µ) C d 2 Σ 2

C (0) d, AT g(Aw) | {z } = J(w)

with Σ := AT A. As in the original proof, we now optimize over the direction ﬁrst, while keeping the step size constant, although we now ﬁx the step size with regard to the norm Σ (which basically means that we still do the optimization in the isotropic space). Note that

max d d, J(x) s.t. d Σ = η

is equivalent to max d d, Σ 1 J(x) Σ s.t. d Σ = η

which is solved by

η Σ 1 J(x) Σ 1 J(x) Σ The remainder of the proof is exactly the same as in the original.

E.2 Conservative RFD

In the ﬁrst paragraph of Section 2 we motivated the relation between RFD and classical optimization with the observation, that gradient descent is the minimizer of a regularized ﬁrst order Taylor approximation

1 L J(w) = argmin d T[J(w d) | J(w), J(w)] + L

This regularized Taylor approximation is in fact an upper bound on our function under the Lsmoothness assumption [38], i.e.

J(w d) T[J(w d) | J(w), J(w)] + L

An improvement on of this upper bound compared to J(x) therefore guarantees an improvement of the loss. This guarantee was lost with the conditional expectation (on purpose, as we wanted to consider the average case). Losing this guarantee also makes convergence proofs more difﬁcult as they typically make use of this improvement. In view of the conﬁdence intervals of Figure 1, it is natural to ask for a similar upper bound in the random setting, where this can only be the top of an conﬁdence interval. This is provided in the following theorem Lemma E.3 (An γ-upper bound). We have

P J(w d) E[J(w d) | J(w), J(w)] + ργ( d 2) γ

for ργ(η2) := Φ 1(γ)σ(η2) with

σ2(η2) := C(0) C η2

where Φ is the cumulative distribution function (cdf) of the standard normal distribution.

Proof. Note that the conditional variance is with the usual argument about the covariance matrices (cf. the proof of Thoerem 4.2) using Lemma G.2 and an application of Theorem G.1 given by

σ2( w 2) := Var[J(w d) | J(w), J(w)] = C(0) C d 2

Since the conditional distribution is normal (by Theorem G.1), we have

J(w d) E[J(w d) | J(w), J(w)]

σ( w 2) N(0, 1).

But this implies the claim

P J(w d) E[J(w d) | J(w), J(w)] + ργ( d 2)

= P J(w d) E[J(w d) | J(w), J(w)]

σ( w 2) Φ 1(γ)

= Φ(Φ 1(γ)) = γ.

To avoid the Gaussian assumption, one could apply the Markov inequality instead, or another applicable concentration inequality.

Using this upper bound, we obtain a natural conservative extension of RFD

Extension E.4 (γ-conservative RFD). Let J N(µ, C) and ργ(η2) = Φ 1(γ)σ(η2), where σ is the conditional standard deviation as deﬁned in Lemma E.3. Then the conservative RFD step direction is given by

J(w) = argmin d E[J(w d) | J(w), J(w)] + ργ( d 2)

and the γ-conservative RFD step size is given by

η = argmin η

C(0) (J(w) µ) η C η2

C (0) J(w) + ργ(η2).

Proof. The proof is the same as in Theorem 4.2 with Lemma 4.1 replaced by Lemma E.3.

Taking multiple steps should generally have an averaging effect, so we expect faster convergence for almost risk neutral minimization of the conditional expectation (i.e. γ 1

2). Here γ is a natural parameter to vary conservatism. In a software implementation it might be a good idea to call this parameter conservatism and rescale it to be in [0, 1] instead of [ 1

2, 1]. But formulas look cleaner with γ.

In Bayesian optimization it is much more common to reverse this approach and minimize a lower conﬁdence bound ( conservatism < 0 or γ < 1

2) in order to encourage exploration. But since RFD is forgetful, this is not a good idea for RFD. Remark E.5 (Conservative RFD coincides asymptotically with RFD in high dimension). While conservative RFD might seem like a good approach to ﬁx the instability of RFD under the isotropy assumption on some optimization problems, the variance generally vanishes in high dimension [see 5] and conservative RFD coincides asymptotically with RFD. We therefore believe that the underlying issue is not an overly risk-afﬁne algorithm but rather that distributional assumptions, in particular the stationarity assumption, are violated when instabilities occur (cf. Section F).

Nevertheless, conservative RFD might be a good approach for lower dimensional, risk-sensitive applications.

E.3 Beyond the Gaussian assumption

In this section we sketch how the extension beyond the Gaussian case using the best linear unbiased estimator BLUE [e.g. 27, ch. 7] works.

For this we recapitulate what a BLUE is. A linear estimator ˆY of Y using X1, . . . , Xn is of the form ˆY span{X1, . . . , Xn} + R.

The set of unbiased linear estimators is deﬁned as

LUE = LUE [Y | X1, . . . , Xn] = { ˆY span{X1, . . . , Xn} + R : E[ ˆY ] = E[Y ]} (25)

= { ˆY + E[Y ] : ˆY span{X1 E[X1], . . . , Xn E[Xn]}}.

And the BLUE is the best linear unbiased estimator, i.e.

BLUE[Y | X1, . . . , Xn] := argmin ˆY LUE

E[ ˆY Y 2]. (26)

Other risk functions to minimize are possible, but this is the usual one. Lemma E.6. If X, Y1, . . . , Yn are multivariate normal distributed, then we have

BLUE[Y | X1, . . . , Xn] = E[Y | X1, . . . , Xn] = argmin ˆY {f(X1,...,Xn):f meas.} E[ Y ˆY 2] .

Proof. We observe that the conditional expectation of Gaussian random variables is linear (Theorem G.1). So as a linear function its L2 risk must be larger or equal to that of the BLUE. And as an L2 projection [30, Cor. 8.17] the conditional expectation was already optimal.

If we now replace the conditional expectation with the BLUE, then all our theory remains the same because the result in Theorem G.1 remains the BLUE for general distributions [27]. Instead of minimizing min d E[J(w d) | J(w), J(w)]

we can therefore always minimize

min d BLUE[J(w d) | J(w), J(w)]

without the Gaussian assumption and all our results can be translated to this case. The reader only needs to replace all mentions of Theorem G.1 with the BLUE equivalent and replace all idependence claims with uncorrelated .

F Input invariance

In this section we generalize the notion of isotropy to non-stationary isotropy and discuss why we believe this generalization is necessary. Recall that we motivated isotropy as an invariant distribution with regard to isometric transformations of the input. In particular its distribution stays invariant with regard to translations (also known as stationarity), which we do not believe plausible for cost functions, because the cost at zero J(0) behaves fundamentally different from the cost of any other parameter vector.

In the following we will therefore generalize this notion to general input invariant distributions. And we will discuss their applicability to machine learning after we characterize the named categories. Deﬁnition F.1 (Input Invariance). A random function f is Φ-input invariant, if 5

Pf = Pf φ φ Φ.

For certain sets of Φ we give these Φ-input invariant distributions names

5The input to a random function is somewhat ambiguous since it is a random variable, i.e. function from the probability space Ωinto function space, so its ﬁrst input should be ω Ω. Formally, the deﬁnition should therefore be: For all measurable sets of functions A

Pf(φ 1 (A)) = Pf(A) φ Φ

where φ : f 7 f φ denotes the pullback. But this is less helpful for an intuitive understanding.

If Φ is the set of isometries, we call f (stationary) isotropic. If Φ is the set of translations, we call f stationary. If Φ is the set of linear isometries, we call f non-stationary isotropic.

We further say a random function f is n-weakly Φ-input invariant, if for all φ Φ, all k n and all xi E[f(φ(x1)) f(φ(xk))] = E[f(x1) f(xk)].

Since second moments fully determine Gaussian distributions, 2-weakly input invariance is special, because it is equivalent to full input invariance in the Gaussian case. So an omitted n equals 2. Weakly isometry invariant naturally becomes weakly isotropic , etc.

While stationary and stationary isotropic random functions are well known [e.g. 44, 1], we are not aware of research on non-stationary isotropy although we doubt the concept is new. It turns out that the different notions of input isometry have simple characterizations in terms of the covariance functions. We present these in Theorem F.2 of which the stationary isotropic and stationary case are already well known.

Theorem F.2 (Characterization of Weak Input Invariances). Let f : Rd R be a random function, then f is

1. weakly stationary, if and only if there exists µ R and function C : Rd R such that for all x, y µf(x) = µ, Cf(x, y) = C(x y).

2. weakly non-stationary isotropic, if and only if there exist functions µ : R 0 R and κ : D R with D = {λ R2 0 R : |λ3| 2 λ1λ2} R3. such that for all x, y

µf(x) = µ x 2

Cf(x, y) = κ x 2

3. weakly stationary isotropic, if and only if there exists µ R and a function C : R 0 R such that for all x, y

µf(x) = µ, Cf(x, y) = C x y 2

Proof. The proof essentially follow as a corollary from a characterization of isometries (Proposition F.4). For details see Subsec F.2.

Non-stationary isotropy is therefore a generalization of stationary isotropy. It allows the zero parameter vector to have special meaning because the distribution is only invariant to linear isometries (i.e. rotations and reﬂections) which keep the zero in place.

It is important to highlight, that a geometric anisotropy (Section E.1) retains stationarity, while breaking non-stationary isotropy. A similar geometric generalization could also be applied to nonstationary isotropy.

Another important observation is the fact, that non-stationary isotropy coincides with stationary isotropy on the sphere. I.e. when x and y are constant, the function

only depends on x y and the mean is also constant. In other words, we have stationary isotropy on the sphere.

Isotropy might therefore get by as an assumption in machine learning, as parameters are typically initialized on the sphere. This is because Glorot intialization [18] samples parameter entries independently, so their squared norm

i=1 (w(i))2

is a sum of independent random variables which are normalized such that a law of large numbers applies. Up to small variance their lengths are therefore all the same, and are placed on a sphere at this radius.

If we leave this sphere, this equivalence stops being true. Weight normalization [46], batch normalization [25], weight decay [e.g. 19] or equivalently L2 regularization, etc. might all contribute to keep this assumption intact.

But in the following section we will see, that even simple linear regressions considered by researches investigating the average case behavior on quadratic functions [e.g. 58, 43, 33, 12, 9, 40, 41], require non-stationary isotropy. Moreover the covariance kernels suggested by investigations into random neuronal networks [e.g. 54, 8] are also non-stationary isotropic but not stationary isotropic.

F.1 Random linear regression

In this section, we determine the distribution of the cost function induced by a simple linear regression. For this we deﬁne the mean squared sample loss ℓi(w) = (Y fw(X))2, where the random data X is mapped by the true relationship f to labels Y = f(X) and fw(x) = x, w is a linear model. If the true relationship f is also a random linear function f(x) = θ, x with random signal θ N(0, I) independent of input X N(0, I), then the cost function is given by

J(w) = E[ℓi(w) | f] = E[ θ w, X 2 | θ]

= (θ w)T E[XXT ](θ w)

Lemma F.3. The expectation and covariance of J are given by E[J(w)] = const. + w 2

Cov(J(w), J( w)) = const. + 4 w, w In particular, the cost J is non-stationary isotropic, but not stationary isotropic.

Proof. Its expectation is given by E[J(w)] = E[ θ w 2] = E[ θ 2] 2 E[θ] |{z} =0

= const. + w 2

In particular it is not constant, but dependent on w 2, which means that we do not have stationary isotropy. But there is still hope for non-stationary isotropy, and this is essentially true as can be seen by calculating

Cov(J(w), J( w)) = E h (J(w) E[J(w)])(J( w) E[J( w)]) i

= E h ( θ 2 E θ 2 2 θ, w )( θ 2 E θ 2 2 θ, w ) i

= Var( θ 2 E θ 2)

2E[( θ 2 E θ 2) θ, w ]

2E[( θ 2 E θ 2) θ, w ]

+ 4w T E[θθT ] w

= Var( θ 2 E θ 2) + 4 w, w = const. + 4 w, w because the terms in the middle are zero, e.g. E[( θ 2 E θ 2) θ, w ] = E[ θ 2θ] | {z } =0

, w E θ 2 E[θ] |{z} =0

where the entries of E[ θ 2θ] are zero, because of independence and ﬁrst moments being zero and third moments being zero.

F.2 Proof of Theorem F.2

Proposition F.4 (Characterizing isometries). Let X be a vectorspace and xi, yi X for i = 1, . . . , n, then the following pairs of statements are equivalent

1. (a) xi xj = yi yj for all i, j (b) there exists a translation φ with φ(xi) = yi for all i.

In the remainder we further assume X to be a Hilbertspace,

2. (a) xi = yi and xi xj = yi yj for all i, j (b) there exists a linear isometry φ with φ(xi) = yi for all i.

3. (a) xi xj = yi yj for all i, j (b) there exists an (afﬁne) isometry φ with φ(xi) = yi for all i.

Proof. (1a) (1b): we deﬁne φ(x) := x + (y0 x0), which implies φ(xi) = xi x0 + y0 = (yi y0) + y0 = yi.

(1b) (1a): Let φ(x) = x + c for some c. Then we immediately have

yi yj = φ(xi) φ(xj) = xi + c (xj + c) = xi xj.

(2a) (2b): By the polarization formula, for all i, j

xi, xj = xi 2 + yi 2 xi xj 2

2 = yi, yj .

We apply the Gram-Schmidt orthonormalization procedure to both xi and yi such that

Ukn = span(u1, . . . , ukn) = span(x1, . . . , xn)

for orthonormal ui where we skip xm if it is already in Ukm 1 (resulting in km = km 1), and similarly Vkn = span(v1, . . . , vkn) = span(y1, . . . , yn). Since this procedure only uses scalar products, we inductively get

xk, uj = yk, vj k, j

We now extend ui and vi to orthonormal basis of X and deﬁne the linear mapping by its behavior on the basis elements φ : ui 7 vi. Mapping an orthonormal basis to an orthonormal basis is an isometry and we have

φ(xk) = φ k X

j=1 xk, uj uj

j=1 xk, uj φ(uj) =

j=1 yk, vj vj

(2b) (2a): Isometries preserve distances by deﬁnition. This implies xi xj = yi yj . And linear functions map 0 to 0, so we have

xi = xi 0 = φ(xi) φ(0) = yi .

(3a) (3b): We deﬁne xi = xi x0 and similarly for y. In particular, x0 = y0 = 0. Since xi and yi satisfy the requirements of 2, there exists a linear isometry φ with φ( xi) = yi. Then the isometry

φ : x 7 φ(x x0) + y0 does the job.

(3b) (3a): This is precisely the distance preserving property of Isometries.

Theorem F.2 (Characterization of Weak Input Invariances). Let f : Rd R be a random function, then f is

1. weakly stationary, if and only if there exists µ R and function C : Rd R such that for all x, y µf(x) = µ, Cf(x, y) = C(x y).

2. weakly non-stationary isotropic, if and only if there exist functions µ : R 0 R and κ : D R with D = {λ R2 0 R : |λ3| 2 λ1λ2} R3. such that for all x, y

µf(x) = µ x 2

Cf(x, y) = κ x 2

3. weakly stationary isotropic, if and only if there exists µ R and a function C : R 0 R such that for all x, y

µf(x) = µ, Cf(x, y) = C x y 2

Proof. Starting from the mean and covariance function it is easy to check 2-weak non-stationary isotropy. So we only need to check the other direction.

The proof is essentially an application of Prop. F.4. For brevity (and since the other two results are well known), we will only prove the weakly non-stationary isotropic case (the other two cases can be proven with minor adjustments to the proof).

Without loss of generality, we will ﬁnd the slightly different representation E[fd(x)] = µ( x ) and Cfd(x, y) = κ( x , y , x, y ),

where the domain of κ is given by D = {λ R2 0 R : |λ3| λ1λ2}. The representation of the theorem is then equivalent by a change to

µ(λ) := µ λ2

2 and κ(λ1, λ2, λ2) := κ λ2 1 2 , λ2 2 2 , λ3 .

First we want to ﬁnd µ. Let v be some vector (w.l.o.g. v = 1). Then we deﬁne µ(r) := E[fd(rv)] Now we need to show that this deﬁnition of µ is an appropriate mean function. For this choose any x X. Then for r = x there exists by Prop. F.4 (2.) a non-stationary isometry φ such that φ(x) = rv (we use n = 1). With 1-weak non-stationary isotropy of fd this implies E[fd(x)] = E[fd(rv)] = µ(r) = µ( x ). Next we need to deﬁne κ(rx, ry, rxy). For this, choose two orthonormal vectors v, w. For every r = (rx, ry, rxy) D we deﬁne x (r) = rxv

y (r) = rxy

r2y r2 xy r2x w.

Where r D ensures |rxy| rxry and thus r2 y r2 xy r2x 0. Then we have

x (r) = rx, y (r) = ry, and x (r), x (y) = rxy, (27) and deﬁne κ(rx, ry, rxy) := Cfd(x (r), y (r)). Again, we need to show that this kernel does the job. For this, choose any x, y X. For r := ( x , y , x, y ),

which is in D by the Cauchy-Schwarz inequality, the induced x (r) and y (r) satisfy by (27) x (r) = x , y (r) = y and x (r) y (r) = x y . By Prop. F.4 (2.) there therefore exists an isometry φ such that φ(x) = x (r) and φ(y) = y (r). By 2-weak input isotropy of fd we conclude

Cfd(x, y) isotrop. = Cfd(x (r), y (r)) def. = κ x , y , x, y .

G Technical

G.1 Conditional Gaussian distribution

For the following well known result we found a tidy proof giving insight into the reason it is true, so we wrote it down for your convenience but do not even expect this particular proof to be new. Theorem G.1 (Conditional Gaussian distribution). Let X N(µ, Σ) be a multivariate Gaussian vector where the covariance matrix is a block matrix of the form

and Σ = Σ11 Σ12 Σ21 Σ22

then assuming Σ11 is invertible, the conditional distribution of X2 given X1 is

X2 | X1 N(µ2|1, Σ2|1),

with conditional mean and variance

µ2|1 := µ2 + Σ21Σ 1 11 (X1 µ1)

Σ2|1 := Σ22 Σ21Σ 1 11 Σ12.

Proof. Let X := X µ be the centered version of X. There exists some lower triangular matrix L (even if Σ is only positive semideﬁnite only not uniquely) such that Σ = LLT (i.e. the Cholesky Decomposition). We can then write without loss of generality

X µ =: X1 X2

= L11 0 L21 L22

with independent standard normal Yi, i.e. Y N(0, I). Since Σ11 is invertible, so is L11 and therefore the map from Y1 to X1. Conditioning on X1 is therefore equivalent to conditioning on Y1. But we have X2 = µ2 + X2 = µ2 + L21Y1 | {z } conditional expectation + L22Y2 | {z } conditional distribution So it follows that X2 | X1 N(µ2|1, Σ2|1) with

µ2|1 := µ2 + L21Y1 Σ2|1 := L22LT 22.

What is left to do, is ﬁnd a representation for the Lij using the block matrices of Σ. For this note

Σ = LLT = L11LT 11 L11LT 21 L21LT 11 L22LT 22 + L21LT 21

This implies L21Y1 = (L21LT 11L T 11 )(L 1 11 X1) = Σ21Σ 1 11 (X1 µ1) so we have the desired conditional expectation, and ﬁnally

L22LT 22 = Σ22 L21LT 21 = Σ22 L21(LT 11 | {z } =Σ21

L T 11 )(L 1 11 | {z } =Σ 1 11

L11)LT 21 | {z } =Σ12

G.2 Covariance of derivatives

By Swapping integration and differentiation we have for a centered random function f

Cov( xif(x), f(y)) = E[ xif(x)f(y)] = xi E[f(x)f(y)] = xi Cf(x, y)

So the covariance of a derivative of f with f is equal to a partial derivative of the covariance function [more details in 1]. Similarly other covariances can be calculated, e.g.

Cov( xif(x), yif(y)) = xi yi Cf(x, y).

For this reason the derivatives of the covariance function are interesting as they represent the covariance of derivatives.

Applying this observation to isotropic covariance functions

Cov(f(x), f(y)) = C x y 2

Lemma G.2 (Covariance of derivatives). Let f N(µ, C) and d = x y, then

Cov f(y) jf(y)

f(x) C( d 2

2 ) C ( d 2

if(x) C ( d 2

2 ) d, ei h C ( d 2

2 ) d, ej d, ei + C ( d 2

2 ) ej, ei i

G.3 Constrained linear optimization

Let U be a vectorspace. We deﬁne the projection of a vector w onto U by

PU(w) := argmin v U v w 2

Lemma G.3 (Constrained maximiziation of scalar products). For a linear subspace U Rd, we have

max v U v =λ v, w = λ PU(w) (28)

argmax v U v =λ

v, w = λ PU(w)

Before we get to the proof let us note that this immediately results in the following corollary about minimization.

Corollary G.4 (Constrained minimization of scalar products).

min v U v =λ v, w = λ PU(w) (30)

argmin v U v =λ

v, w = λ PU(w)

Proof of Corollary G.4. The trick is to move one outside from w = ( w)

min v U v =λ v, w = max v U v =λ v, w = λ PU(w)

where we have used in the last equation that the projection is linear (we can move the minus sign out) and the norm removes the inner minus sign. The argmin argument is similar.

Proof of Lemma G.3. Step 1: We claim that

v = λ PU(w)

results in the value v , w = λ PU(w) .

For this we consider

PU(w) = argmin v U v w 2 | {z } = v 2 2 v,w + w 2 (32)

= argmin v U v 2 2 v, w | {z } =:f(v)

we know that t 7 f(t w U) is minimized at t = 1 by the deﬁnition of w U. The ﬁrst order condition implies

dt = 2t PU(w) 2 2 PU(w), w

1 = t = PU(w), w

Multiplying both sides by λ PU(w) ﬁnishes this step

λ PU(w) = D λ PU(w)

PU(w) | {z } =v

, w E . (33)

Step 2: By (33), we know that we can achieve the value we claim to be the maximum (and know the location v to do so). So if we prove that we can not exceed this value, then it is a maximum and v is the argmax. This would ﬁnish the proof. What remains to be shown is therefore

v, w λ PU(w) v U : v = λ.

Let v U with v = λ. Then for any µ R we can plug µv into f from (32) to get

µ2λ2 2µ v, w = f(µv)

(32) f(PUw) = PU(w) 2 2 PUw, w = PUw, w

where the last equation follows from (33) with λ = PUw . Reordering we get for all µ

PUw, w + µ2λ2 2µ v, w

We now select µ = PUw

λ > 0 and divide both sides by µ to get

2 v, w DPU(w)

µ | {z } =v

(33) = λ PUw

+λ PU(w) = 2λ PUw

Dividing both sides by 2 yields the claim.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope?

Answer: [Yes]

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reﬂect how much the results can be expected to generalize to other settings. It is ﬁne to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justiﬁcation: Section 8 recapitulates all the assumptions made and highlights possible generalizations.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-speciﬁcation, asymptotic approximations only holding locally). The authors should reﬂect on how these assumptions might be violated in practice and what the implications would be. The authors should reﬂect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reﬂect on the factors that inﬂuence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efﬁciency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be speciﬁcally instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justiﬁcation: Section D provides all proofs of statements made in the main body and follows an identical structure for easier cross reference.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justiﬁcation: While we do not have the necessary space to discuss all implementation details, we believe that we have discussed all relevant insights necessary to reproduce our results.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or veriﬁable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might sufﬁce, or if the contribution is a speciﬁc model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufﬁcient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justiﬁcation: The code of our algorithm is fairly well commented and passes pylint and ﬂake8 linting. The code to perform the benchmarks is less polished and we have not seeded the covariance estimation sampling process, but, since we obtained similar results over multiple runs (Section A.1.1), we are conﬁdent that our results are reproducible. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justiﬁcation: Training on the MNIST data set is fairly standard, so we feel like our brief outline is sufﬁcient. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Signiﬁcance

Question: Does the paper report error bars suitably and correctly deﬁned or other appropriate information about the statistical signiﬁcance of the experiments? Answer: [Yes] Justiﬁcation: Quantiles are plotted in Figure 3 and the ﬁgures of Section A and we provided a histogram of asymptotic learning rates resulting from multiple covariance estimation runs (Section A.1.1). Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, conﬁdence intervals, or statistical signiﬁcance tests, at least for the experiments that support the main claims of the paper.

The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not veriﬁed. For asymmetric distributions, the authors should be careful not to show in tables or ﬁgures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding ﬁgures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufﬁcient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [No] Justiﬁcation: We have not kept careful track of resources used, as MNIST is a fairly small dataset for machine learning standards. We believe the compute was comparatively negligible, although the use of multiple GPUs was helpful in repeating experiments in parallel. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justiﬁcation: See broader impacts. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justiﬁcation: Since we focus on optimization theory our work has no societal impact beyond the advancement of the the ﬁeld of Machine Learning, which may have many societal consequences, but none we feel necessary to address.

Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake proﬁles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact speciﬁc groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efﬁciency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety ﬁlters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justiﬁcation: We cite datasets and models used. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip ﬁle. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is ﬁne, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary signiﬁcantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.