# random_function_descent__bc63285d.pdf Random Function Descent Felix Benning University of Mannheim felix.benning@uni-mannheim.de Leif Döring University of Mannheim leif.doering@uni-mannheim.de Classical worst-case optimization theory neither explains the success of optimization in machine learning, nor does it help with step size selection. In this paper we demonstrate the viability and advantages of replacing the classical convex function framework with a random function framework. With complexity O(n3d3), where n is the number of steps and d the number of dimensions, Bayesian optimization with gradients has not been viable in large dimension so far. By bridging the gap between Bayesian optimization (i.e. random function optimization theory) and classical optimization we establish viability. Specifically, we use a stochastic Taylor approximation to rediscover gradient descent, which is scalable in high dimension due to O(nd) complexity. This rediscovery yields a specific step size schedule we call Random Function Descent (RFD). The advantage of this random function framework is that RFD is scale invariant and that it provides a theoretical foundation for common step size heuristics such as gradient clipping and gradual learning rate warmup. 1 Introduction Cost function minimization is one of the most fundamental mathematical problems in machine learning. Gradient-based methods, popular for this task, require a step size, typically chosen using established heuristics. This article aims to deepen the theoretical understanding of these heuristics and proposes a new algorithm based on this insight. Classical optimization theory uses L-smoothness, which limits the rate of change of the gradient by L, to provide some convergence guarantees for learning rates smaller than 1/L [e.g. 38]. As this theory is based on an upper bound (the worst case), the learning rate 1/L is naturally much more conservative than necessary on average. Even if L was known, this learning rate would therefore be impractical. Since line search algorithms typically require access to full cost function evaluations, the field of machine learning (ML) therefore relies heavily on step size heuristics [e.g. 48, 49, 42, 20]. To investigate these heuristics, we introduce new ideas based on a random function perspective. While automatic step size selection in the convex function framework is possible [11], convexity is generally only satisfied asymptotically and locally. So the understanding of the initial stages of optimization, which includes the warmup heuristic [20], greatly benefits from a framework which also admits non-convex functions. This objective is achieved by the random function framework we investigate. Many successful algorithms in computer science are significantly slower in the worst case than in the average case based on a probabilistic framework (e.g. Quicksort [23] or the simplex algorithm [e.g. 6]). On random quadratic functions the average case behavior of first order optimizers is already being investigated by the ML community [e.g. 58, 43, 33, 12, 9, 40, 41]. Interested in the landscape of high dimensional random functions as a model for spin glasses , the physics community independently started studying the average case of optimization as well [e.g. 4, 15, 37, 51, 24], albeit not geared for ML algorithms. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Average case analysis fundamentally requires a prior distribution over possible cost functions. The evaluations seen so far then result in a posterior over the cost of other parameter inputs. Using this posterior for optimization is called Bayesian optimization (BO) [e.g. 32, 47, 16, 2], which is best known in the context of low dimensional optimization (e.g. hyperparameter tuning) in the ML community. BO is treated like a zero order method for low dimensional problems due to the O(n3) complexity for the covariance matrix inversion of the n evaluations seen so far, which increases to O(n3d3) when gradient information is included [e.g. 35, 55], where d is the input dimension of our cost function. This limits classic BO to relatively small dimensions even under sparsity considerations [e.g. 45, 39]. While the BO algorithms developed in the random function framework might not have been viable in high dimension so far, due to their computational complexity, this framework is already used to explain the high relative frequency of saddle points in high dimension [10] and to explain the highly predictable progress optimizers make on high dimensional cost functions [5]. In this work we bridge the gap between BO and (computationally viable) gradient based methods, derived from the first Taylor approximation, with the introduction of a stochastic Taylor approximation based on a forgetful BO posterior. The optimization method Random Function Descent (RFD), resulting from the minimization of this stochastic Taylor approximation, coincides with a specific form of gradient descent which establishes its viability in high dimension. The advantages of its BO heritage are scale invariance and an explicit step size schedule, which illuminates the inner workings of step size heuristics such as gradient clipping [42] and gradual learning rate warmup [20]. Our contributions and outline The main goal of this paper is to demonstrate the viability and advantages of replacing the classical convex function framework with a random function framework. Theorem 4.2 is the main theoretical result establishing viability (computatability and scalable complexity) for a given covariance model. Section 6 is concerned with practical estimation of the covariance model and viability is demonstrated with a practical example in the MNIST case study (Section 7). The advantages of this approach are scale invariance (Advantage 2.3) and an explicit step size schedule, which does not require expensive tuning and explains existing ML heuristics such as warmup (cf. Section 5.2). This explanation of the initial stage of optimization could never be delivered by the convex framework, because the convexity assumption is not fulfilled initially so it can at best explain asymptotic behavior. Sec. 2 We motivate a stochastic Taylor approximation and RFD and prove its scale-invariance. Sec. 3 We briefly motivate and discuss the common distributional assumptions in BO. Sec. 4 We establish the connection between RFD and gradient descent. Sec. 5 We investigate the step size schedule suggested by RFD. In particular we 0. calculate explicit formulas for the step size schedules resulting from common covariance models (Table 1, Sec. C), 1. analyze the general asymptotic behavior (Sec. 5.1), 2. discuss how RFD explains gradient clipping and learning rate warmup (Sec. 5.2), Sec. 6 We develop a non-parametric variance estimation method, which is robust with respect to the choice of covariance kernel. Finally, we present an extension of RFD to mini-batch losses. Sec. 7 We conduct a case study on the MNIST dataset. Sec. 8 We discuss extensions (see also Sec. E) and limitations. 2 The random function descent algorithm The classic derivation of gradient descent [e.g. 38, p. 29], adds an L-smoothness based trust bound to the first Taylor approximation, T[J(θ) | J(w), J(w)], of the cost function J around w resulting in the gradient step L J(w) = argmin θ T[J(θ) | J(w), J(w)] + L Our unusual notation for the Taylor approximation T[J(θ) | J(w), J(w)] is meant to highlight the connection to the stochastic Taylor approximation we define below. Definition 2.1 (Stochastic Taylor approximation). We define the first order stochastic Taylor approximation of a random (cost) function1 J around w by the conditional expectation E[J(θ) | J(w), J(w)]. This is the best L2 approximation [30, Cor. 8.17] of J(θ) provided first order knowledge of J at w. Figure 1: The stochastic Taylor approximation naturally contains a trust bound in contrast to the classical one. Here J is a Gaussian random function (with covariance as in Equation (11), with length scale s = 2 and variance σ2 = 1). The ribbon represents two conditional standard deviations around the conditional expectation. We call this the stochastic Taylor approximation because this approximation only makes use of derivatives in a single point. While the standard Taylor approximation is a polynomial approximation, the stochastic Taylor approximation is the best approximation in an L2 sense and already mean-reverting by itself, i.e. it naturally incorporates covariance-based trust (cf. Figure 1). While L-smoothness-based trust guarantees that the gradient still points in the direction we are going (for learning rates smaller 1/L), covariance based trust tells us whether the derivative is still negative on average. Minimizing the stochastic Taylor approximation is therefore optimized for the average case. Since convergence proofs for gradient descent typically rely on an improvement guarantee, proving convergence is significantly harder in the average case and we answer this question only partially in Corollary 5.3. Definition 2.2 (Random Function Descent RFD). Select wn+1 as the minimizer2 of the first order stochastic Taylor approximation wn+1 := argmin w E[J(w) | J(wn), J(wn)]. Properties of RFD Before we make RFD more explicit in Section 4, we discuss some properties which are easier to see in the abstract form. First, observe that RFD is greedy and forgetful in the same way gradient descent is greedy and forgetful when derived as the minimizer of the regularized first Taylor approximation, or the Newton method as the minimizer of the second Taylor approximation. This is because the Taylor approximation only uses derivatives from the last point wn (forgetful), and we minimize this approximation (greedy). Since momentum methods retain some information about past gradients, they are not as forgetful. We therefore expect a similar improvement could be made for RFD in the future. Second, it is well known that classical gradient descent with exogenous step sizes (and most other first order methods) lack the scale invariance property of the Newton method [e.g. 21, 13]. Scale invariance means that scaling the input parameters w or the cost itself (e.g. by switching from the mean squared error to the sum squared error) does not change the points selected by the optimization method. Advantage 2.3 (Scale invariance). RFD is invariant to additive shifts and positive scaling of the cost J. RFD is also invariant with respect to transformations of the parameter input of J by differentiable bijections whose Jacobian is invertible everywhere (e.g. invertible linear maps). While invariance to bijections of inputs is much stronger than the affine invariance offered by the Newton method, non-linear bijections will typically break the isotropy assumption of the following 1Remark on terminology: stochastic process [e.g. 53], random field [e.g. 1] and random function [e.g. 36] are all synonyms. However the latter seems most descriptive of random variables in the set of functions. Gaussian processes are naturally Gaussian stochastic processes, i.e. Gaussian random functions. To better distinguish random functions from deterministic functions, we use bold letters to denote random functions (as the usual convention of capitalizing random variables often clashes with other conventions for functions). 2we ignore throughout the main body that argmin could be set-valued and that the wn would be random variables (cf. Section D.1.1 for a formal approach). section which makes RFD explicit. This invariance should therefore be viewed as an opportunity to look for the bijection of inputs which ensures isotropy (e.g. a whitening transformation). The discussion of geometric anisotropy in Section E.1 is conducive to build an understanding of this. 3 A distribution over cost functions It is impossible to make average case analysis explicit without a distribution over functions, so we use the canonical distributional assumption of Bayesian optimization [e.g. 16, 55, 44], isotropic Gaussian random functions . This assumption was also used in the high dimensional setting by Dauphin et al. [10] to argue that saddle points are much more common than minima in high dimension, which is often cited to explain why second order methods are uncommon in machine learning. To motivate isotropy, we note that in average case analysis the uniform distribution is popular, since it weighs all problem instances equally (e.g. all possible permutations in sorting). Isotropy is such a uniformity assumption, which essentially requires P(J = J) = P(J = J φ) , for all isometries φ. In other words, the probability that our cost function is equal to J is equal to the probability that it is equal to a shifted and turned version of J, given by J φ. Since the probability of any single realization of a cost function J is zero, the equation we put in quotes is mathematically unsound. The formal definition follows below. Definition 3.1 (Isotropy). A random function J is called isotropic if its distribution stays the same under isometric transformations of its input, i.e. for any isometry φ we have If J is Gaussian, isotropy is well known [e.g. 44, 1] to be equivalent to the condition that there exists µ R and a function C : R R such that for all w, w Rd the expectation and covariance are E[J(w)] = µ, Cov(J(w), J( w)) = C w w 2 For these isotropic Gaussian random functions we use the notation J N(µ, C). We discuss generalizations to isotropy in Section F and E.1, but for ease of exposition we retain the (stationary) isotropy assumption throughout the main body. Note that the Gaussian assumption can be statistically tested in practice (cf. Figure 4), but it is also straightforward to reproduce our results with the best linear unbiased estimator (BLUE) (Section E.3) in place of the conditional expectation to remove the Gaussian assumption. We finally want to highlight that, in contrast to the uniformity assumption on finite sets, isotropic Gaussian random functions leave us with a family of plausible distributions. It is therefore necessary to estimate µ and C, which is the topic of Section 6. 4 Relation to gradient descent While we were able to define RFD abstractly without any assumptions on the distribution PJ of the random cost J, an explicit calculation requires distributional assumptions and we have motivated isotropic Gaussian random functions in Section 3 for this purpose. The assumption of isotropy allows for an explicit version of the stochastic Taylor approximation which then immediately leads to an explicit version of RFD. Lemma 4.1 (Explicit first order stochastic Taylor approximation). For J N(µ, C), the first order stochastic Taylor approximation is given by E[J(w d) | J(w), J(w)] = µ + C d 2 C(0) (J(w) µ) C d 2 C (0) d, J(w) . The explicit version of RFD follows by fixing the step size η = d and optimizing over the direction first. Theorem 4.2 (Explicit RFD). Let J N(µ, C), then RFD coincides with gradient descent wn+1 = wn η n J(wn) J(wn) , Table 1: RFD step size (cf. Figure 2 and Eq. (11), (13), (14) for the formal definitions of the models). In particular, s is the length scale in all covariance models. Model RFD step size η for J(w) µ A-RFD General case with Θ = J(w) µ J(w) J(w) = µ Θ 0 3 sΘ 0.58s 1 3s2Θ 2(1+ζ) with ζ := 5 3sΘ. 0.72s 3 5s2Θ Squaredexponential s2 r µ J(w) 2 2 +s2 J(w) 2+ µ J(w) 2 J(w) s s2Θ Rational quadratic β s β Root η sΘ η + (1 + β)η2 + β where the RFD step sizes are given by η n := argmin η R C(0) (J(wn) µ) η C η2 C (0) J(wn) . (1) While the descent direction is a universal property for all isotropic Gaussian random functions, it follows from (1) that the step sizes depend much more on the specific covariance structure. In particular it depends on the decay rate of the covariance acting as the trust bound. Remark 4.3 (Scalable complexity). While Bayesian optimization typically has computational complexity O(n3d3) in number of steps n and dimensions d [55, 45], RFD under the isotropy assumption has the same computational complexity as gradient descent (i.e. O(nd)). Remark 4.4 (Step until the given information is no longer informative). While L-smoothness-based trust prescribes step sizes that guarantee the slope to point downwards over the entire step, RFD prescribes steps which are exactly large enough that the gradient is no longer correlated to the previously observed evaluation. This is because the first order condition demands 0 != E[J(w) | J(wn), J(wn)] = E[ J(w) | J(wn), J(wn)]. And for measurable functions φ : Rd+1 R such that Φ = φ(J(wn), J(wn)) is sufficiently integrable, Φ is then uncorrelated from i J(w) by the first order condition Cov( i J(w), Φ) = E h E[ i J(w) | J(wn), J(wn)] | {z } =0 (Φ E[Φ]) i = 0. 5 The RFD step size schedule While classical theory leads to learning rates , RFD suggests step sizes applied to normalized gradients representing the actual length of the step size. In the following we thus make the distinction wn+1 = wn hn |{z} learning rate J(wn) = wn ηn |{z} step size J(wn) J(wn) . To get a better feel for the step sizes suggested by RFD, it is enlightening to divide (1) by µ J(wn) which results in a minimization problem η := η (Θ) := argmin η qΘ(η) for qΘ(η) := C η2 C(0) η C η2 C (0) Θ, (2) which is only parametrized by the gradient cost quotient i.e. η n = η (Θn). This minimization problem can be solved explicitly for the most common [44, ch. 4] differentiable isotropic covariance models, see Table 1, Figure 2 and Appendix C for details. Figure 2: RFD step sizes as a function of Θ = J(w) µ J(w) assuming scale s = 1 (cf. Table 1). ARFD (Definition 5.1) is plotted as dashed lines. A-RFD of the rational quadratic coincides with A-RFD of the squared exponential covariance. Figure 2 can be interpreted as follows: At the start of optimization, the cost should be roughly equal to the average cost µ J(w), so the gradient cost quotient Θ is infinite and the step sizes are therefore given by η ( ) (also listed in its own column in Table 1). As we start minimizing, the difference µ J(w) becomes positive. Towards the end of minimization this difference no longer changes as the cost no longer decreases. I.e. towards the end the gradient cost quotient Θ is roughly linear in the gradient J(w) . The derivative d dΘη (0) of η (Θ) at zero then effectively results in a constant asymptotic learning rate. 5.1 Asymptotic learning rate To explain the claim above, note that the gradient cost quotient Θ converges to zero towards the end of optimization, because the gradient norm converges to zero. A first order Taylor expansion of η would therefore imply η (Θ) η (0) + d dΘη (0)Θ = d dΘ η (0) µ J(w) | {z } asymptotic learning rate assuming η (0) = 0 and differentiability of η , which is a reasonable educated guess based on the the examples in Figure 2. But since the RFD step sizes η are abstractly defined as an argmin, it is necessary to formalize this intuition for general covariance models. First, we define asymptotic step sizes as an object towards which we can prove convergence. Then we prove convergence, proving they are well defined. In addition, we obtain a more explicit formula for the asymptotic learning rate. Definition 5.1 (A-RFD). We define the step sizes of asymptotic RFD (A-RFD) to be the minimizer of the second order Taylor approximation T2qΘ of qΘ around zero ˆη(Θ) := argmin η T2qΘ(η) = C(0) C (0)Θ = C(0) C (0)(J(w) µ) | {z } asymptotic learning rate In the following we prove that these are truly asymptotically equal to the step sizes η of RFD. Proposition 5.2 (A-RFD is well defined). Let J N(µ, C) and assume there exists η0 > 0 such that the correlation for larger distances η η0 are bounded smaller than 1, i.e. C(η2/2) C(0) < ρ (0, 1). Then the step sizes of RFD are asymptotically equal to the step sizes of A-RFD, i.e. ˆη(Θ) η (Θ) as Θ 0. Note that the assumption is essentially always satisfied, since the Cauchy-Schwarz inequality implies 2 = Cov(J(w), J( w)) p Var(J(w)) Var(J( w)) = C(0), where equality requires the random variables to be almost surely equal [30]. If the random function is not periodic or constant, this will generally be strict. In the proof, this requirement is only needed to ensure that η is not very large. The smallest local minimum of qΘ is always close to ˆη even without this assumption (which ensures it is a global minimum). Figure 2 illustrates that η 0 should imply Θ 0, resulting in a weak convergence guarantee. Corollary 5.3. Assume η 0 implies Θ 0, the cost J is bounded, has continuous gradients and RFD converges to some point w . Then w is a critical point and the RFD step sizes η are asymptotically equal to ˆη. For the squared exponential covariance model we formally prove that η is strictly monotonously increasing in Θ and thus η 0 implies Θ 0 (Prop. C.3). The bounded and continuous gradients assumptions are almost surely satisfied for all sufficiently smooth covariance functions [cf. 1], where three times differentiable is more than enough smoothness. 5.2 RFD step sizes explain common step size heuristics Asymptotically, RFD suggests constant learning rates, similar to the classical L-smooth setting. We thus define these asymptotic learning rates (as the limit of the learning rates hn of iteration n) to be h := C(0) C (0)(J(w ) µ), (3) where J(w ) is the cost we reach in the limit. If we used these asymptotic learning rates from the start, step sizes would become too large for large gradients, as RFD step sizes exhibit a plateau (cf. Figure 2). To emulate the behavior of RFD with a piecewise linear function, we could introduce a cutoff whenever our step size exceeds the initial step size η ( ), i.e. wn+1 = wn min n h , η ( ) J(wn) o J(wn). (gradient clipping) At this point we have rediscovered gradient clipping [42]. Since the rational quadratic covariance has the same asymptotic learning rate h for every β, its parameter β controls the step size bound η ( ) of gradient clipping (cf. Table 1, Figure 2). Pascanu et al. [42] motivated gradient clipping with the geometric interpretation of movement towards a wall placed behind the minimum. This suggests that clipping should happen towards the end of training. This stands in contrast to a more recent step size heuristic, (linear) warmup [20], which suggests smaller learning rates at the start (i.e. h0 = η ( ) J(w0) ) and gradual ramp-up to the asymptotic learning rate h . In other words, gradients are not clipped due to some wall next to the minimum, but because the step sizes would be too large at the start otherwise. Goyal et al. [20] further observe that constant warmup (i.e. a step change of learning rates akin to gradient clipping) performs worse than gradual warmup. Since RFD step sizes suggest this gradual increase, we argue that they may have discovered RFD step sizes empirically (also cf. Figure 3). 6 Mini-batch loss and covariance estimation Since we do not have access to evaluations of the cost J in practice, we need to prove some results about stochastic losses ℓi before we can apply RFD in practice. For this, assume that we have independent identically distributed (iid) data Xi independent of the true relationship f drawn from Pf resulting in labels Yi = f(Xi) + ςi, where we have added independent iid noise ςi, resulting in loss and cost ℓi(w) := ℓ w, (Xi, Yi) and J(w) := E[ℓi(w) | f]. In this setting we confirm (cf. Lemma D.9), that the stochastic approximation errors ϵi(w) := ℓi(w) J(w) are independent conditional on the true relationship f. In particular they (and all their derivatives) are uncorrelated and also uncorrelated from J. It follows that mini-batch losses i=1 ℓi(w) = J(w) + 1 i=1 ϵi(w) (4) have variance Var(Lb(w)) = Var(J(w)) + 1 b Var(ϵ1(w)) isotropy = C(0) + 1 b Cϵ(0), (5) where we assume J N(µ, C) and ϵi N(0, Cϵ) in the last equation for simplicity. But this step did not yet require the distributional Gaussian assumption beyond the mean and variance. 6.1 Variance estimation Recall that the asymptotic learning rate h in Equation (3) only depends on C(0) and C (0). So if we estimate these values, we are certain to get the right RFD step sizes asymptotically without knowing the entire covariance kernel C. Equation (5) reveals that for Zb := (Lb(w) µ)2 we have E[Zb] = β0 + 1 bβ1 i.e. Zb = β0 + 1 bβ1 + noise with bias β0 = C(0) and slope β1 = Cϵ(0). So a linear regression on samples ( 1 bk , Zbk)k n allows for the estimation of β0 and β1. Using the Gaussian assumption from (5), the variance of Zb is the (centered) fourth moment of Lb, which is given by σ2 b := Var(Zb) = E[Z4 b ] E[Z2 b ]2 = 2 Var(Lb(w))2 = 2(β0 + 1 In particular the variance of Zb depends on the batch size b. The linear regression is therefore heteroskedastic. Weighted least squares (WLS) [e.g. 28, Theorem 4.2] is designed to handle this case, but for its application the variance of Zb is needed. Since β0, β1 are the parameters we wish to estimate, we find ourselves in the paradoxical situation that we need β to obtain β. Our solution to this problem is to start with a guess of β0, β1, apply WLS to obtain a better estimate and repeat this bootstrapping procedure until convergence. Since all Zb have the same underlying cost J, we sample the parameters w randomly to reduce their covariance (details in Sec. B). The same procedure can be applied to obtain C (0), where the counterpart of Equation (5) is given by Var( i Lb(w)) = Var( i J(w)) + 1 b Var( iϵ1(w)) isotropy = (C (0) + 1 Remark 6.1. Under the isotropy assumption the partial derivatives are iid, so the expectation of Lb(w) 2 = Pd i=1( i Lb(w))2 is this variance scaled by d. In particular the variance needs to scale with 1 d to keep the gradient norms (and thus the Lipschitz constant of J) stable. This observation is closely related to isoperimetry [e.g. 7], for details see [5]. Removing the isotropy assumption and estimating the variance component-wise is most likely how adaptive step sizes [e.g. 14, 29], like the ones used by Adam, work (cf. Sec. E.1). Batch size distribution Before we can apply linear regression to the samples ( 1 bk , Zbk)k n, it is necessary to choose the batch sizes bk. As this choice is left to us, we calculate the variance of our estimator ˆβ0 of β0 explicitly (Lemma B.2), in order to minimize this variance subject to a sample budget α over the selection of batch sizes min n,b1,...,bn Var( ˆβ0) s.t. | {z } samples used Since this optimization problem is very difficult to solve, we rephrase it in terms of the empirical distribution of batch sizes νn = 1 n Pn i=1 δbi. Optimizing over distributions is still difficult, but we explain in Section B.1 how to heuristically arrive at the parametrization ν(b) exp λ1 1 σ2 b λ2b , b N where the parameters λ1, λ2 0 can then be used to optimize (6). Due to our usage of σ2 b this has to be bootstrapped. Covariance estimation While the variance estimates above ensure correct asymptotic learning rates, we motivated in Section 5.2 that asymptotic learning rates alone would result in too large step sizes at the beginning. We therefore use the estimates of C(0) and C (0) to fit a covariance model, effectively acting as a gradient clipper while retaining the asymptotic guarantees. Note that covariance models with less than two parameters are generally fully determined by these values. 6.2 Stochastic RFD (S-RFD) It is reasonable to ask whether there is a stochastic gradient descent -like counterpart to the gradient descent -like RFD. The answer is yes, and we already have all the required machinery. Extension 6.2 (S-RFD). For loss J N(µ, C) and stochastic errors ϵi iid N(0, Cϵ) we have argmin d E[J(w d) | Lb(w), Lb(w)] = η (Θ) Lb(w) with the same step size function η as for RFD, but modified Θ Θ = C (0) C (0) + 1 b C ϵ(0) C(0) + 1 b Cϵ(0) C(0) Lb(w) µ Lb(w). Note, that our non-parametric covariance estimation already provides us with estimates of Cϵ(0) and C ϵ(0), so no further adaptions are needed. The resulting asymptotic learning rate is given by h = C(0) + 1 b Cϵ(0) (C (0) + 1 b C ϵ(0))(Lb(w ) µ). (7) 7 MNIST case study For our case study we use the negative log likelihood loss to train a neural network [3, M7] on the MNIST dataset [34]. We choose this model as one of the simplest state-of-the-art models at the time of selection, consisting only of convolutional layers with Re LU activation interspersed by batch normalization layers and a single dense layers at the end with softmax activation. Assuming isotropy, we estimate µ, C(0) and C (0) as described in Section 6.1 and deduce the parameters σ2 and s of the respective covariance model (more details in Section B). We then use the step sizes listed in Table 1 for the squared exponential and rational quadratic covariance in our RFD algorithm. In Figure 3, RFD is benchmarked against step size tuned Adam [29] and stochastic gradient descent (SGD). Even with early stopping, their tuning would typically require more than 1 epoch worth of samples, in contrast to RFD (Section A.1.1). We highlight that A-RFD performs significantly worse than either of the RFD versions which effectively implement some form of learning rate warmup. This is despite the RFD learning rates converging to the asymptotic one within one epoch (ca. 30 out of 60 steps per epoch). The step sizes on the other hand are (up to noise) monotonously decreasing. This stands in contrast to the wall next to the minimum motivation of gradient clipping. Code availability: Our implementation of RFD can be found at https://github.com/ Felix Benning/pyrfd and the package can also be installed from Py PI via pip install pyrfd . 8 Limitations and extensions To cover the vast amount of ground that lays between the formulation of a general average case optimization problem and the prototype of a working optimizer with theoretical backing , 1. we used the common [16, 52, 55, 10] isotropic and Gaussian distributional assumption for J, 2. we used very simple covariance models for the actual implementation, 3. we used WLS in our variance estimation procedure despite the violation of independence. Since RFD is defined as the minimizer of an average instead of an upper bound making it more risk affine it naturally loses the improvement guarantee driving classical convergence proofs. It is therefore impossible to extend classical optimization proofs and new mathematical theory must be developed. This risk-affinity can also be observed in its comparatively large step sizes (cf. Fig. 3 and Sec. A). On CIFAR-100 [31], the step sizes were too large and it is an open question whether assumptions were violated or whether RFD is simply too risk-affine. But since the variance of random functions vanishes asymptotic with high dimension [5] we highly suspect the former (cf. Remark E.5). Future work will therefore have to target these assumptions. Some of the assumptions were already simplifications for the sake of exposition, and we deferred their relaxation to the appendix. The Gaussian assumption can be relaxed with a generalization to the BLUE (Sec. E.3), isotropy can be generalized to geometric anisotropies (Sec. E.1) and the risk-affinity of RFD can be reduced with confidence intervals (Sec. E.2). Since simple random linear models already violate stationary isotropy (Sec. F.1), we believe that stationarity is the most important assumption to attack in future work. 9 Conclusion In this paper we have demonstrated the viability (computability and scalable complexity) and advantages (scale invariance, explainable step size schedule which does not require expensive 0 5 10 15 20 25 30 Epoch Validation loss RFD(RQ(beta=1)) A-RFD SGD(lr=1) Adam(lr=1e-2) 10 4 10 3 10 2 10 1 100 101 Learning rate Final validation loss 0 20 40 60 80 100 Step Learning rate Learning rate RFD(RQ(beta=1)) A-RFD 0 250 500 750 1000 1250 1500 1750 Step RFD(RQ(beta=1)) A-RFD SGD(lr=1) Adam(lr=1e-2) Figure 3: Training on the MNIST dataset (batch size 1024). Ribbons describe the range between the 10% and 90% quantile of 20 repeated experiments while lines represent their mean. SE stands for the squared exponential (11) and RQ for the rational quadratic (13) covariance. The validation loss uses the test data set, which provides a small advantage to Adam and SGD, as we also use it for tuning. tuning) of replacing the classical convex function framework with the random function framework. Along the way we bridged the gap between Bayesian optimization (not scalable so far) and classical optimization methods (scalable). This theoretical framework not only sheds light on existing step size heuristics, but can also be used to develop future heuristics. We envision the following improvements to RFD in the future: 1. The reliability of RFD can be improved by generalizing the distributional assumptions to cover more real world scenarios. In particular we are interested in the generalization to non-stationary isotropy because we suspect that regularization such as weight and batch normalization [46, 25] are used to patch violations of stationarity (cf. Section F). 2. The performance of RFD can also be improved. Since RFD is forgetful while momentum methods retains some information it is likely fruitful to relax the full forgetfulness. Furthermore, we suspect that adaptive learning rates [e.g. 14, 29], such as those used by Adam, can be incorporated with geometric anisotropies (cf. Sec. E.1). Performance could also be further improved by estimating the covariance (locally) online instead of globally at the start. Finally, the implementation itself can be made more performant. Acknowledgement We extend our sincere gratitude to our colleagues at the University of Mannheim, with special thanks to Rainer Gemulla and Julie Naegelen for insightful discussions and invaluable feedback. The Experiments in this work were partially carried out on the compute cluster of the state of Baden-Würtemberg (bw HPC). [1] Robert J. Adler and Jonathan E. Taylor. Random Fields and Geometry. Springer Monographs in Mathematics. New York, NY: Springer New York, 2007. ISBN: 978-0-387-48112-8. DOI: 10.1007/978-0-387-48116-6. [2] Apoorv Agnihotri and Nipun Batra. Exploring Bayesian Optimization . In: Distill 5.5 (202005-05), e26. ISSN: 2476-0757. DOI: 10.23915/distill.00026. [3] Sanghyeon An et al. An Ensemble of Simple Convolutional Neural Network Models for MNIST Digit Recognition. 2020-10-04. DOI: 10.48550/ar Xiv.2008.10400. Pre-published. [4] Antonio Auffinger and Qiang Zeng. Complexity of Gaussian Random Fields with Isotropic Increments . In: Communications in Mathematical Physics 402.1 (2023-08-01), pp. 951 993. ISSN: 1432-0916. DOI: 10.1007/s00220-023-04739-0. [5] Felix Benning and Leif Döring. Gradient Span Algorithms Make Predictable Progress in High Dimension. 2024-10-13. DOI: 10.48550/ar Xiv.2410.09973. Pre-published. [6] Karl Heinz Borgwardt. The Simplex Method: A Probabilistic Analysis. Softcover reprint of the original 1st ed. 1987 edition. Berlin Heidelberg: Springer, 1986-11-01. 282 pp. ISBN: 978-3-540-17096-9. [7] Sebastien Bubeck and Mark Sellke. A Universal Law of Robustness via Isoperimetry . In: Advances in Neural Information Processing Systems. Vol. 34. Virtual Event: Curran Associates, Inc., 2021, pp. 28811 28822. ar Xiv: 2105 . 12806 [cs, stat]. URL: https : / / proceedings . neurips . cc / paper / 2021 / hash / f197002b9a0853eca5e046d9ca4663d5-Abstract.html (visited on 2023-09-22). [8] Youngmin Cho and Lawrence Saul. Kernel Methods for Deep Learning . In: Advances in Neural Information Processing Systems. Vol. 22. Curran Associates, Inc., 2009. URL: https : / / proceedings . neurips . cc / paper / 2009 / hash / 5751ec3e9a4feab575962e78e006250d-Abstract.html (visited on 2023-04-03). [9] Leonardo Cunha et al. Only Tails Matter: Average-Case Universality and Robustness in the Convex Regime . In: Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR, 2022-06-28, pp. 4474 4491. URL: https://proceedings.mlr.press/v162/cunha22a.html (visited on 2023-11-09). [10] Yann N Dauphin et al. Identifying and Attacking the Saddle Point Problem in High Dimensional Non-Convex Optimization . In: Advances in Neural Information Processing Systems. Vol. 27. Montréal, Canada: Curran Associates, Inc., 2014. URL: https://proceedings. neurips.cc/paper/2014/hash/17e23e50bedc63b4095e3d8204ce063b-Abstract. html (visited on 2022-06-10). [11] Aaron Defazio and Konstantin Mishchenko. Learning-Rate-Free Learning by D-Adaptation . In: Proceedings of the 40th International Conference on Machine Learning. International Conference on Machine Learning. PMLR, 2023-07-03, pp. 7449 7479. ar Xiv: 2301.07733 [cs.LG]. URL: https://proceedings.mlr.press/v202/defazio23a.html (visited on 2024-03-31). [12] Percy Deift and Thomas Trogdon. The Conjugate Gradient Algorithm on Well-Conditioned Wishart Matrices Is Almost Deterministic . In: Quarterly of Applied Mathematics 79.1 (202103), pp. 125 161. ISSN: 0033-569X, 1552-4485. DOI: 10.1090/qam/1574. ar Xiv: 1901. 09007 [cs, math]. [13] P. Deuflhard and G. Heindl. Affine Invariant Convergence Theorems for Newton s Method and Extensions to Related Methods . In: SIAM Journal on Numerical Analysis 16.1 (1979-02), pp. 1 10. ISSN: 0036-1429. DOI: 10.1137/0716001. [14] John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization . In: The Journal of Machine Learning Research 12 (2011-07-01), pp. 2121 2159. ISSN: 1532-4435. [15] Ahmed El Alaoui, Andrea Montanari, and Mark Sellke. Optimization of Mean-Field Spin Glasses . In: The Annals of Probability 49.6 (2021-11), pp. 2922 2960. DOI: 10.1214/21AOP1519. [16] Peter I. Frazier. Bayesian Optimization . In: Recent Advances in Optimization and Modeling of Contemporary Problems. INFORMS Tut ORials in Operations Research. Phoenix, Arizona, USA: INFORMS, 2018-10, pp. 255 278. ISBN: 978-0-9906153-2-3. DOI: 10.1287/educ. 2018.0188. ar Xiv: 1807.02811 [cs, math, stat]. [17] Fuchang Gao and Lixing Han. Implementing the Nelder-Mead Simplex Algorithm with Adaptive Parameters . In: Computational Optimization and Applications 51.1 (2012-01-01), pp. 259 277. ISSN: 1573-2894. DOI: 10.1007/s10589-010-9329-3. [18] Xavier Glorot and Yoshua Bengio. Understanding the Difficulty of Training Deep Feedforward Neural Networks . In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. Sardinia, Italy: JMLR Workshop and Conference Proceedings, 2010-03-31, pp. 249 256. URL: https://proceedings.mlr.press/v9/ glorot10a.html (visited on 2023-04-11). [19] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016-11-10. 801 pp. ISBN: 978-0-262-33737-3. Google Books: omiv DQAAQBAJ. [20] Priya Goyal et al. Accurate, Large Minibatch SGD: Training Image Net in 1 Hour. ar Xiv, 201804-30. ar Xiv: 1706.02677 [cs]. URL: http://arxiv.org/abs/1706.02677 (visited on 2024-04-02). [21] Anders Hansson. Optimization for Learning and Control. Hoboken, New Jersey: John Wiley & Sons, Inc., 2023. ISBN: 978-1-119-80914-2. [22] Geoffrey Hinton. Neural Networks for Machine Learning . Massive Open Online Course (Coursera). 2012. URL: https://www.cs.toronto.edu/~hinton/coursera_lectures. html (visited on 2021-11-16). [23] C. A. R. Hoare. Quicksort . In: The Computer Journal 5.1 (1962-01-01), pp. 10 16. ISSN: 0010-4620. DOI: 10.1093/comjnl/5.1.10. [24] Brice Huang and Mark Sellke. Tight Lipschitz Hardness for Optimizing Mean Field Spin Glasses . In: 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS). 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS). 2022-10, pp. 312 322. DOI: 10.1109/FOCS54457.2022.00037. [25] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift . In: Proceedings of the 32nd International Conference on Machine Learning. International Conference on Machine Learning. PMLR, 2015-06-01, pp. 448 456. ar Xiv: 1502.03167 [cs]. URL: https://proceedings.mlr. press/v37/ioffe15.html (visited on 2021-10-06). [26] E. T. Jaynes. Information Theory and Statistical Mechanics . In: Physical Review 106.4 (1957-05-15), pp. 620 630. DOI: 10.1103/Phys Rev.106.620. [27] Richard Arnold Johnson and Dean W. Wichern. Applied Multivariate Statistical Analysis. 6th ed. Upper Saddle River, N.J: Pearson College Div, 2007. 767 pp. ISBN: 978-0-13-1877153. [28] Steven M. Kay. Fundamentals of Statistical Signal Processing: Estimation Theory. USA: Prentice-Hall, Inc., 1993-02. 595 pp. ISBN: 978-0-13-345711-7. [29] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization . In: Proceedings of the 3rd International Conference on Learning Representations. ICLR. San Diego, 2015. ar Xiv: 1412.6980. [30] Achim Klenke. Probability Theory: A Comprehensive Course. Universitext. London: Springer, 2014. ISBN: 978-1-4471-5360-3 978-1-4471-5361-0. DOI: 10.1007/978-1-4471-5361-0. [31] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. 2009. URL: https: //www.cs.toronto.edu/%20kriz/learning-features-2009-TR.pdf (visited on 2024-05-21). [32] H. J. Kushner. A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise . In: Journal of Basic Engineering 86.1 (1964-03-01), pp. 97 106. ISSN: 0021-9223. DOI: 10.1115/1.3653121. [33] Jonathan Lacotte and Mert Pilanci. Optimal Randomized First-Order Methods for Least Squares Problems . In: Proceedings of the 37th International Conference on Machine Learning. International Conference on Machine Learning. PMLR, 2020-11-21, pp. 5587 5597. URL: https://proceedings.mlr.press/v119/lacotte20a.html (visited on 2023-11-09). [34] Yann Le Cun, Corinna Cortes, and Christopher J.C. Burges. THE MNIST DATABASE of Handwritten Digits. 2010. URL: http://yann.lecun.com/exdb/mnist/ (visited on 2024-01-24). [35] Daniel James Lizotte. Practical Bayesian Optimization . Ph D thesis. Alberta, Canada: University of Alberta, 2008. 168 pp. [36] G. Matheron. The Intrinsic Random Functions and Their Applications . In: Advances in Applied Probability 5.3 (1973-12), pp. 439 468. ISSN: 0001-8678, 1475-6064. DOI: 10.2307/ 1425829. [37] Andrea Montanari. Optimization of the Sherrington Kirkpatrick Hamiltonian . In: SIAM Journal on Computing (2021-01-07), FOCS19 1. ISSN: 0097-5397. DOI: 10.1137/20M132016X. [38] Yurii Evgen eviˇc Nesterov. Lectures on Convex Optimization. Second edition. Springer Optimization and Its Applications; Volume 137. Cham: Springer, 2018. ISBN: 978-3-319-91578-4. DOI: 10.1007/978-3-319-91578-4. [39] Misha Padidar et al. Scaling Gaussian Processes with Derivative Information Using Variational Inference . In: Advances in Neural Information Processing Systems. Vol. 34. Curran Associates, Inc., 2021, pp. 6442 6453. URL: https://proceedings.neurips.cc/paper/2021/ hash/32bbf7b2bc4ed14eb1e9c2580056a989-Abstract.html (visited on 2023-05-17). [40] Courtney Paquette et al. Halting Time Is Predictable for Large Models: A Universality Property and Average-Case Analysis . In: Foundations of Computational Mathematics 23.2 (2022-02), pp. 597 673. ISSN: 1615-3383. DOI: 10.1007/s10208-022-09554-y. ar Xiv: 2006.04299 [math, stat]. [41] Elliot Paquette and Thomas Trogdon. Universality for the Conjugate Gradient and MINRES Algorithms on Sample Covariance Matrices . In: Communications on Pure and Applied Mathematics 76.5 (2022-09-01), pp. 1085 1136. ISSN: 1097-0312. DOI: 10.1002/cpa. 22081. [42] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the Difficulty of Training Recurrent Neural Networks . In: Proceedings of the 30th International Conference on Machine Learning. International Conference on Machine Learning. Atlanta: PMLR, 2013-05-26, pp. 1310 1318. URL: https://proceedings.mlr.press/v28/pascanu13.html (visited on 2024-04-02). [43] Fabian Pedregosa and Damien Scieur. Acceleration through Spectral Density Estimation . In: Proceedings of the 37th International Conference on Machine Learning. International Conference on Machine Learning. Virtual Event (formerly Vienna): PMLR, 2020-11-21, pp. 7553 7562. URL: https://proceedings.mlr.press/v119/pedregosa20a.html (visited on 2023-11-09). [44] Carl Edward Rasmussen and Christopher K.I. Williams. Gaussian Processes for Machine Learning. 2nd ed. Adaptive Computation and Machine Learning 3. Cambridge, Massachusetts: MIT Press, 2006. 248 pp. ISBN: 0-262-18253-X. URL: http://gaussianprocess.org/ gpml/chapters/RW.pdf (visited on 2023-04-06). [45] Filip de Roos, Alexandra Gessner, and Philipp Hennig. High-Dimensional Gaussian Process Inference with Derivatives . In: Proceedings of the 38th International Conference on Machine Learning. International Conference on Machine Learning. PMLR, 2021-07-01, pp. 2535 2545. URL: https://proceedings.mlr.press/v139/de-roos21a.html (visited on 2023-05-15). [46] Tim Salimans and Durk P Kingma. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks . In: Advances in Neural Information Processing Systems. Vol. 29. Barcelona, Spain: Curran Associates, Inc., 2016. URL: https : / / proceedings . neurips . cc / paper / 2016 / hash / ed265bc903a5a097f61d3ec064d96d2e-Abstract.html (visited on 2023-10-16). [47] Bobak Shahriari et al. Taking the Human Out of the Loop: A Review of Bayesian Optimization . In: Proceedings of the IEEE 104.1 (2016-01), pp. 148 175. ISSN: 1558-2256. DOI: 10.1109/JPROC.2015.2494218. [48] Leslie N. Smith. A Disciplined Approach to Neural Network Hyper-Parameters: Part 1 Learning Rate, Batch Size, Momentum, and Weight Decay. 2018-04-24. DOI: 10.48550/ ar Xiv.1803.09820. ar Xiv: 1803.09820 [cs, stat]. Pre-published. [49] Leslie N. Smith. Cyclical Learning Rates for Training Neural Networks . In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). 2017-03, pp. 464 472. DOI: 10.1109/WACV. 2017.58. [50] Michael L. Stein. Interpolation of Spatial Data. Springer Series in Statistics. New York, NY: Springer, 1999. ISBN: 978-1-4612-7166-6 978-1-4612-1494-6. DOI: 10.1007/978-1-46121494-6. [51] Eliran Subag. Following the Ground States of Full-RSB Spherical Spin Glasses . In: Communications on Pure and Applied Mathematics 74.5 (2021), pp. 1021 1044. ISSN: 1097-0312. DOI: 10.1002/cpa.21922. [52] Ziyu Wang et al. Bayesian Optimization in a Billion Dimensions via Random Embeddings . In: Journal of Artificial Intelligence Research 55 (2016-02-19), pp. 361 387. ISSN: 1076-9757. DOI: 10.1613/jair.4806. [53] C.K.I. Williams and D. Barber. Bayesian Classification with Gaussian Processes . In: IEEE Transactions on Pattern Analysis and Machine Intelligence 20.12 (1998-12), pp. 1342 1351. ISSN: 1939-3539. DOI: 10.1109/34.735807. [54] Christopher K. I. Williams. Computation with Infinite Neural Networks . In: Neural Computation 10.5 (1998-07-01), pp. 1203 1216. ISSN: 0899-7667. DOI: 10 . 1162 / 089976698300017412. [55] Jian Wu et al. Bayesian Optimization with Gradients . In: Advances in Neural Information Processing Systems. Vol. 30. Curran Associates, Inc., 2017. URL: https://proceedings. neurips.cc/paper/2017/hash/64a08e5f1e6c39faeb90108c430eb120-Abstract. html (visited on 2022-06-02). [56] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. 2017-09-15. DOI: 10.48550/ar Xiv.1708. 07747. ar Xiv: 1708.07747 [cs, stat]. Pre-published. [57] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method . 2012-12-22. ar Xiv: 1212.5701 [cs]. [58] Guodong Zhang et al. Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model . In: Advances in Neural Information Processing Systems. Vol. 32. Curran Associates, Inc., 2019. ar Xiv: 1907 . 04164 [cs, stat]. URL: https : / / proceedings . neurips . cc / paper / 2019 / hash / e0eacd983971634327ae1819ea8b6214-Abstract.html (visited on 2023-11-09). 10 3 10 2 10 1 3.5 Lb(w) batchwise mean ˆµ = 2.73 3 2 1 0 1 2 3 Ordered Values b=10 b=100 b=1000 0.00 0.02 0.04 0.06 0.08 0.10 (Lb(w) ˆµ)2 batchwise mean squares Var(Lb(w)) 0 2 4 6 8 10 Ordered Values b=10 b=100 b=1000 0.00 0.02 0.04 0.06 0.08 0.10 1/b batchwise mean E[ Lb(w) 2] 2.286 2.288 2.290 2.292 2.294 2.296 2.298 Theoretical quantiles 106 Ordered Values b=10 b=100 b=1000 Figure 4: Visualization of the variance estimation (Section 6.1) with 95%-confidence intervals based on the assumed distribution. Quantile-quantile (QQ) plots of the losses (against a normal distribution), squared losses (against a χ2(1) distribution) and squared gradient norms (against a χ2(d)-distribution) are displayed on the right for a selection of batch sizes. Appendix: Random Function Descent A Experiments A.1 Covariance estimation In Figure 4 we visualize weighted least squares (WLS) regression of the covariance estimation from Section 6.1. Note, that we sampled much more samples per batch size for these plots than RFD would typically require by itself in order to be able to plot batch-wise means and batch-wise QQ-plots. The batch size distribution we described in Section B.1 would avoid sampling the same batch size multiple times to ensure better stability of the regression and generally requires much fewer samples than were used for this visualization (cf. A.1.1) We can observe from the QQ-plots on the right, that the Gaussian assumption is essentially justified for the losses, resulting in a χ2(1) distribution for the squared losses and a χ2(d) distribution for the gradient norms squared. The confidence interval estimate for the squared norms appears to be much too small (it is plotted, but too small to be visible). Perhaps this signifies a violation of the isotropy 10.0 12.5 15.0 17.5 20.0 22.5 25.0 Asymptotic learning rate 0 100000 200000 300000 400000 500000 Sample cost Figure 5: 20 repeated covariance estimations of model M7 [3] applied to the MNIST dataset. On the left are the resulting asymptotic learning rates (assuming a final loss of zero) and on the right are the samples used until the stopping criterion interrupted sampling. assumption as the variance of i=1 ( i Lb(w))2 does not appear to be the variance of independent χ2(d) Gaussian random variables, and the independence only follows from the isotropy assumption. A.1.1 Sampling efficiency and stability To evaluate the sampling efficiency and stability of our variance estimation process, we repeated the covariance estimation of the model model M7 [3] applied to the MNIST dataset 20 times (Figure 5). We used a tolerance of tol = 0.3 as a stopping criterion for the estimated relative standard deviation (10). At this tolerance, the asymptotic learning rate already seems relatively stable (in the same order of magnitude) and the sample cost is quite cheap. The majority of runs (16/20 runs or 80%) required less than 60 000 samples (1 epoch). There was one large outlier which used 500 589 samples. A closer inspection revealed, that after the initial sample to estimate the optimal batch size distribution, it sampled almost exclusively at batch sizes 20 (which was the minimal cutoff to avoid instabilities caused by batch normalization) and batch sizes between 1700 and 1900. It therefore seems like the initial batch of samples caused a very unfavorable batch size distribution which then required a lot of samples to recover from. Our selection of an initial sample size of 6000 might therefore have been too small. A more extensive empirical study is needed to tune this estimation process, but the process promises to be very sample efficient. Classical step size tuning would train models for a short duration in order to evaluate the performance of a particular learning rate [e.g. 48], but a single epoch worth of samples is very hard to beat. Our implementation of this process on the other hand is very inefficient as of writing. Piping data of differing batch sizes into a model is not a standard use case. We implement this by repeatedly initializing data loaders, which is anything but performance friendly. A.2 Other models and datasets To estimate the effect of the batch size on RFD, we trained the same model (M7 [3]) on MNIST with batch size 128 (Figure 6). We can see that the asymptotic learning rate of S-RFD is reduced at a smaller batch size (cf. Equation 7) but the performance is barely different. Overall, RFD seems to be slightly too risk-affine, selecting larger step sizes than the tuned SGD models. We also trained a different model (M5 [3]) on the Fashion MNIST dataset [56] with batch size 128 (Figure 7). Since the validation loss increases after epoch 5, early stopping would have been 0 5 10 15 20 25 30 Epoch Validation loss SGD(lr=0.1) Adam(lr=1e-3) 10 4 10 3 10 2 10 1 100 101 Learning rate Final validation loss 0 20 40 60 80 100 Step Learning rate Learning rate 0 2000 4000 6000 8000 10000 12000 14000 Step SGD(lr=0.1) Adam(lr=1e-3) Figure 6: Training model M7 [3] with batch size 128 on MNIST [34]. appropriate. We therefore include Adam with learning rate 10 3, despite Adam with learning rate 10 4 technically performing better at the end of training. We can generally see, that RFD comes very close to tuned performance at the time early stopping would have been appropriate. Again, learning rates seem to be slightly too large (risk-affine) in comparison to tuned SGD. B Variance estimation in detail Recall that we are interested in the regression Zb(w) = (Lb(w) µ)2 β0 + 1 where the variance of Zb is given by σ2 b = 2(β0 + 1 under the Gaussian assumption on Lb. More specifically we for minibatch sizes (bk)k n and parameter vectors (wk)k n we want to sample mini batch losses L(k) := Lbk(wk) = J(wk) + 1 i=1 ϵk,i(wk) As the ϵk,i are all conditionally independent and therefore uncorrelated, we have Cov(L(k), L(l)) = Cov(J(wk), J(wl)) = C wk wl 2 Since the covariance kernel C is typically monotonously falling in the distance of parameters wk wl 2, we want to select them as spaced out as possible to minimize the covariance of L(k) (which is the next best thing to iid samples). Randomly selecting wi with Glorot initialization [18] will ensure a good spread. Note that Glorot initialization places all parameters approximately on the same sphere. This is because Glorot initialization initializes all parameters independently, therefore their norm is the 0 5 10 15 20 25 30 Epoch Validation loss SGD(lr=0.1) Adam(lr=1e-3) Adam(lr=1e-4) 10 4 10 3 10 2 10 1 100 101 Learning rate Final validation loss 0 20 40 60 80 100 Step Learning rate Learning rate 0 2000 4000 6000 8000 10000 12000 14000 Step SGD(lr=0.1) Adam(lr=1e-3) Adam(lr=1e-4) Figure 7: Model M5 [3] trained on Fashion MNIST [56] with batch size 128. sum of independent squares, which converges by the law of large numbers due to the normalization Glorot uses. Since stationary isotropy and non-stationary isotropy coincides on the sphere, this is an important effect to consider (cf. Section F). What is left, is the selection of the batch sizes bk. B.1 Batch size distribution Since we plan to use the data set ( 1 bk , L(k))k n for weighted least squares (WLS) regression and do not have a selection process for the batch sizes bk yet, it might be appropriate to select the batch sizes bk in such a way, that the variance of our estimator ˆβ0 of β0 is minimized. Here we choose Var(ˆβ0) and not Var(ˆβ1) as our optimization target, since β0 = C(0) is used to fit the covariance model, while β1 = Cϵ(0) is only required for S-RFD. Without deeper analysis β0 therefore seems to be more important. Optimization over n parameters bk is quite difficult, but we can simplify this optimization problem by considering the empirical batch size distribution Using a random variable B distributed according to νn, the total number of sample losses can then be expressed as k=1 bk = n E[B] = samples used. Under an (unrealistic) independence assumption, the variance Var(ˆβ0) also has a simple representation in terms of νn (Lemma B.2). We now want to minimize this variance subject to compute constraint α limiting the number of sample losses we can use resulting in the optimization problem Var( ˆβ0) = 1 σ2 B ] | {z } variance of ZB E[ 1 σ2 BB2 ] E h 1 σ2 B 1 B E[ 1 Bσ2 BE[1/σ2 B]] 2i | {z } inverse of the spread of 1 s.t. n E[B] α. (8) where we recall that σ2 B is the variance of ZB. So the inverse of the expectation of its inverse is roughly the average variance of ZB. The second half is the fraction of a weighted second moment divided by the weighted variance. Unless the mean is at zero, the former will be larger. In particular we want a spread of data otherwise the variance would be zero. This is in some conflict with the variance of ZB. But first, let us get rid of n. Note that we would always increase n until our compute budget is used up, since this always reduces variance. So we approximately have n E[B] = α. Thus Var( ˆβ0) = E[B] E[ 1 σ2 BB2 ] E h 1 σ2 B 1 B E[ 1 Bσ2 BE[1/σ2 B]] 2i Since α is now just resulting in a constant factor, it can be assumed to be 1 without loss of generality. Over batch size distributions ν we therefore want to solve the minimization problem min ν E[B] E[ 1 σ2 B ] | {z } moments E[ 1 σ2 BB2 ] E h 1 σ2 B 1 B E[ 1 Bσ2 BE[1/σ2 B]] 2i | {z } spread Example B.1 (If we did not require spread). If we were not concerned with the variance of batch sizes, we could select a constant B = b. Then it is straightforward to minimize the moments factor manually min b E[b] E[ 1 σ2 b ] = bσ2 b = 2b(β0 + 1 resulting in 1 β1 . In other words: If we did not have to be concerned with the spread of B there is one optimal selection to minimize the first factor. But in reality we have to trade-off this target with the spread of B. To ensure a good spread of data, we use the maximum entropy distribution for B, with the moment constraints n average sample usage σ2 B ] θ ZB variance which capture the first factor. Maximizing entropy under moment constraints is known [26] to result in the Boltzmann (a.k.a. Gibbs) distribution ν(b) = P(B = b) exp λ1 1 σ2 b λ2b , where λ1, λ2 depend on the momentum constraints. We can now forget the origin of this distribution and use λ1, λ2 as parameters for the distribution ν in Equation (9) to get close to its minimum. In practice we use a zero order black box optimizer (Nelder-Mead [17]). One could calculate the expectations of (9) under this distribution explicitly and take manual derivatives with respect to λi to investigate this further, but we wanted to avoid getting too distracted by this tangent. We also use the estimated relative standard deviation as a stopping criterion for sampling. Without extensive testing we found a tolerance of rel_std < tol = 0.3 to be reasonable, cf. Section A.1.1. Lemma B.2 (Variance of ˆβ0 in terms of the empirical batch size distribution). Assuming independence of the samples (( 1 bk ), Zbk)k n, the variance of ˆβ0 is given by Var( ˆβ0) = 1 E[ 1 σ2 BB2 ] E h 1 σ2 B 1 B E[ 1 Bσ2 BE[1/σ2 B]] 2i where B is distributed according to the empirical batch size distribution νn = 1 n Pn k=1 δbk. Proof. With the notation σ2 k = σ2 bk to describe the variance of Zbk it follows from [cf. 28, Thm. 4.2] that the variance of the estimator ˆβ of β using n samples is given by Var(ˆβ) = (HT C 1H) 1 k 1 (σkbk)2 (P k 1 σ2 kbk )2 k 1 (σkbk)2 P k 1 σ2 kbk P k 1 σ2 kbk P σ2 1 ... σ2 n 1 1 b1 ... 1 1 bn In particular we have k 1 σ2 kb2 k P k 1 σ2 kb2 k (P k 1 σ2 kbk )2 . With the help of θ := P j 1 σ2 j and λk := 1 σ2 kθ, we can reorder the divisor. For this note that since the λk sum to 1 we have X j λj 1 bj + X k λk 1 b2 k 2 X k λk 1 b2 k X Where the above is essentially the well known statement E[(Y E[Y ])2] = E[Y 2] E[Y ]2 for an appropriate selection of Y . This implies that our divisor is given by a weighted variance 1 σ2 kb2 k X where it is only necessary to plug in the definition of θ to see the right term is exactly our divisor. Expanding both the enumerator as well as the divisor by 1 n, we obtain Var(ˆβ0) = 1 k 1 σ2 kb2 k 1 n P Since θ = n E[1/σ2 B] for B 1 n Pn k=1 δbk and λk = 1 nσ2 k E[1/σ2 B], the above can thus be written as Var( ˆβ0) = 1 n 1 E[1/σ2 B] E[ 1 σ2 BB2 ] E h 1 σ2 B 1 B E[ 1 Bσ2 BE[1/σ2 B]] 2i, which proves our claim. C Covariance models In this section we calculate the step sizes of the covariance models listed in Table 1 and plotted in Figure 2. Additionally we calculate the asymptotic learning rate of A-RFD and prove an Assumption of Corollary 5.3 for the squared exponential covariance (Prop. C.3). C.1 Squared exponential The squared exponential covariance function is given by 2 = σ2 exp x y 2 Note that σ2 will play no role in the step sizes of RFD due to its scale invariance (cf. Advantage 2.3). Theorem C.1. Let J N(µ, C) where C is the squared exponential covariance function (11), then we have J(w) = argmin d E[J(w d) | J(w), J(w)] with RFD step size η = s2 J(w) q µ J(w) 2 2 + s2 J(w) 2 + µ J(w) Proof. The covariance function C is of the form C(h) = σ2e h By Equation (2) η = argmin η 2 ) C(0) η C ( η2 2 ) C (0) Θ. where Θ = J(w) µ J(w) . We calculate C(0) η C η2 C (0) Θ = e η2 2s2 (1 + ηΘ). This results in the first order condition 2s2 (1 + ηΘ) e η2 2s2 Θ = e η2 s2 (η2Θ + η s2Θ). Since the exponential can never be zero, we have to solve a quadratic equation. Its solution results in η (Θ) = q 1 2Θ 2 + s2 1 2Θ. (12) At this point we could stop, but the result is numerically unstable as it suffers from catastrophic cancellation. To solve this issue we set x = 1 2Θ and reorder x2 + s2 x = ( p x2 + s2 + x x2 + s2 + x = x2 + s2 x2 x2 + s2 + x . Re-substituting x = 1 2Θ = µ J(w) 2 J(w) , we finally get η = s2 q µ J(w) 2 J(w) 2 + s2 + µ J(w) 2 J(w) = s2 J(w) q µ J(w) 2 2 + s2 J(w) 2 + µ J(w) Proposition C.2 (A-RFD for the Squared Exponential Covariance). If J is isotropic with squared exponential covariance (11), then the step size of A-RFD is given by µ J(w) J(w) , Proof. By Definition 5.1 of A-RFD and Θ = J(w) µ J(w) we have ˆη(Θ) = C(0) C (0)Θ = σ2 exp(0) s2 exp(0) J(w) µ J(w) = s2 J(w) Proposition C.3. If J is isotropic with squared exponential covariance (11), then the RFD step sizes are strictly monotonously increasing in Θ. Proof. Since we know that Θ 0 implies η ˆη 0 strict monotonicity of η in Θ is sufficient to show that η 0 also implies Θ 0. So we take the derivative of (12) resulting in d dΘη = 1 1 which is greater zero for all Θ > 0. C.2 Rational quadratic The rational quadratic covariance function is given by 2 = σ2 1 + x y 2 β/2 β > 0. (13) It can be viewed as a scale mixture of the squared exponential and converges to the squared exponential in the limit β [44, p. 87]. Theorem C.4 (Rational Quadratic). For J N(µ, C) where C is the rational quadratic covariance we have for Θ = J(w) µ J(w) 0 that the RFD step size is given by sΘ η + (1 + β)η2 + β The unique root of the polynomial in η can be found either directly with a formula for polynomials of third degree (e.g. using Cardano s method) or by bisection as it is contained in [0, 1/ 1 + β]. Proof. By Theorem 4.2 we have η = argmin η C η2 C(0) η C η2 for C(x) = σ2(1 + 2x βs2 ) β/2. We therefore need to minimize β/2 η 1 + η2 Substitute in η := η βs, then the first order condition is 0 != f ( η) = d d η 1 + η2 β/2 + p βs η 1 + η2 β/2 1 Θ Dividing both sides by βsΘ we get 0 = f ( η) βsΘ = β 2 (1 + η2) β 2 12 η 1 βsΘ + (1 + η2) β 2 2 h 1 + η2 ( β 2 + 1)2 η2i = (1 + η2) β 2 2 h β η 1 βsΘ(1 + η2) [1 η2(1 + β)] i sΘ η+(1+β) η2+ β Since Θ 0 and β > 0 all coefficients of the polynomial are positive except for the shift. The polynomial thus starts out at 1 in zero and only increases from there. Therefore there exists a unique positive critical point which is a minimum. At the point η = 1 + β the quadratic term is already larger than 1 so the polynomial is positive and we have passed the root. The minimum is therefore contained in the interval [0, 1 + β]. After finding the minimum in η we return to η by multiplication with βs. Proposition C.5 (A-RFD for the Rational Quadratic Covariance). If J is isotropic with rational quadratic covariance (13), then the step size of A-RFD is given by µ J(w) J(w) . Proof. C(x) = σ2(1 + 2x βs2 ) β/2 implies by Definition 5.1 of A-RFD and Θ = J(w) ˆη(Θ) = CJ(0) C J(0)Θ = σ2(1 + 0) β/2 s2 (1 + 0) β/2 1 J(x) µ J(x) = s2 J(x) Definition C.6. The Matérn model parametrized by s > 0, ν 0, σ2 0 is given by 2 = σ2 21 ν where Kν is the modified Bessel function. For ν = p + 1 2 with p N0, it can be simplified [cf. 44, sec. 4.2.1] to (2p k)! (p k)!k! 2 The Matérn model encompasses Rasmussen and Williams [44] the nugget effect for ν = 0 (independent randomness) the exponential model for ν = 1 2 (Ornstein-Uhlenbeck process) the squared exponential model for ν with the same scale s and variance σ2. The random functions induced by the Matérn model are a.s. ν -times differentiable Rasmussen and Williams [44], i.e. the smoothness of the model increases with increasing ν. While the exponential covariance model with ν = 1 2 results in a random function which is not yet differentiable, larger ν result in increasing differentiability. As differentiability starts with ν = 3 2 and we have a more explicit formula for ν = p + 1 2 the cases ν = 3 2 and ν = 5 2 are of particular interest. [F]or ν 7/2, in the absence of explicit prior knowledge about the existence of higher order derivatives, it is probably very hard from finite noisy training examples to distinguish between values of ν 7/2 (or even to distinguish between finite values of ν and ν , the smooth squared exponential, in this case) [44, p. 85]. Theorem C.7. Assuming J N(µ, C) is a random function where C is the Matérn covariance such that ν = p + 1 2 with p {1, 2}. Then the RFD step is given for Θ := J(w) µ J(w) 0 by 5 (1 ζ) + p 4 + (1 + ζ)2 2(1 + ζ) ζ := Proof. We define C(η) := C( η2 2 ), which implies C (η) = C η2 or conversely η C (η). (15) By Theorem 4.2, we need to calculate η = argmin η C( η2 2 ) C(0) η C ( η2 2 ) C (0) Θ. (16) Discarding σ w.l.o.g. due to scale invariance (Advantage 2.3), we have in the case p = 1 The derivative is then given by which implies using (15) C η2 3 s η (17) We therefore need to minimize (16) which is given by argmin η 1 + 3 s η η exp 3 s η Θ = argmin η 1 + ( 3 s + Θ)η exp The first order condition is 3 s + Θ)η ( which (divided by Θ and noting that the exponential can never be zero) is equivalent to 3 sΘ + 1)η 1 reordering for η implies It is also not difficult to see that this is the point where the derivative switches from negative to positive (i.e. a minimum). Let us now consider the case p = 2, i.e. 5 s η + 5 3s2 η2 exp which results in C (η) = 5 5 s η , i.e. by (15) C η2 5 s η . (18) We therefore need to minimize (16) which is given by 1 + 5 s η + 5 3s2 η2 η 1 + 5 s +Θ η+ 5 The first order condition results in 5 s + Θ η + 5 5 s + Θ + 2 5 Dividing everything by Θ and using ζ := 5 3sΘ we get 5 s η + ζ + 1 Taking a closer look at the sign changes of the derivative it becomes obvious, that the positive root is the minimum, i.e. 5 s η != (1 ζ) + p (1 ζ)2 + 4(1 + ζ) 2(1 + ζ) = (1 ζ) + p 4 + (1 + ζ)2 Proposition C.8 (A-RFD for the Matérn Covariance). If J is isotropic with Matérn covariance (14) such that ν = p + 1 2, then the step size of A-RFD for p {1, 2} is given by 3 J(x) µ J(x) 5 J(x) µ J(x) Proof. Noting Θ = J(x) µ J(x) , we have by Definition 5.1 of A-RFD for p = 1 ˆη = C(0) C (0)Θ (17) = s2 and in the case p = 2 ˆη = C(0) C (0)Θ (18) = 3s2 In this section we prove all the claims made in the main body. D.1 Section 2: Random function descent D.1.1 Formal RFD As we mentioned in a footnote at the definition of RFD, the fact that the parameters become random variables as they are selected by random gradients poses some mathematical challenges which would have been distracting to address in the main body. In following paragraphs leading up to Definition D.1 we introduce and discuss the probability theory required to provide a mathematically sound definition. For a fixed cost distribution PJ and any weight vectors w and w the conditional distribution E[J( w) | J(w), J(w)] is by its axiomatic definition a (J(w), J(w))-measurable random variable. By the factorization lemma [30, Cor. 1.9.7], there therefore exists a measurable function (j, g) 7 ϕw, w(j, g) such that the following equation holds almost surely ϕw,θ(J(w), J(w)) = E[J( w) | J(w), J(w)]. (19) Since it is possible to calculate ϕw, w explicitly in the Gaussian case (cf. G.1), the function ΦPJ : (Rd R Rd) R (w, j, g) 7 argmin w ϕw, w(j, g), which implements some tie-breaker rules for set valued argmin is measurable when J is Gaussian and its covariance function is sufficiently smooth. To prove measurability in the general case is a difficult problem of its own, which we do not attempt to solve here, since we would not utilize the conditional expectation outside of the Gaussian case anyway (cf. Section E.3). For deterministic w, we therefore have ΦPJ(w, J(w), J(w)) = argmin w ϕw, w(J(w), J(w)) (19) = argmin w E[J( w) | J(w), J(w)]. So if the parameter vectors wn were deterministic, our formal definition of RFD and our initial definition would coincide. But for random weights W (19) stops to hold in general3, i.e. ϕW, w(J(W), J(W)) = E[J( w) | J(W), J(W)]. If this equation does not need to hold, we similarly have in general ΦPJ(W, J(W), J(W)) = argmin w E[J( w) | J(W), J(W)]. So the following definition is not just a restatement of the original definition of RFD. Definition D.1 (Formal RFD). For a Gaussian random cost function J, we define the RFD algorithm with starting point W0 = w0 Rd by Wn+1 := ΦPJ(Wn, J(Wn), J(Wn)) This is what we effectively do in Theorem 4.2 under the additional isotropy assumption, where we calculate the argmin under the assumption that w is deterministic (i.e. we determine ΦPJ), before we plug-in the random variables Wn to obtain Wn+1. Similarly this is how the step size prescriptions of RFD actually work. We first assume deterministic weights and later plug the random variables into our formulas. For this reason, we avoided large letters indicating random variables for parameters w in the main body. D.1.2 Scale invariance Advantage 2.3 (Scale invariance). RFD is invariant to additive shifts and positive scaling of the cost J. RFD is also invariant with respect to transformations of the parameter input of J by differentiable bijections whose Jacobian is invertible everywhere (e.g. invertible linear maps). Before we get to the proof, let us quickly formulate the statement in mathematical terms. Let wn be the parameters selected optimizing J starting in w0 and wn the parameters selected by the same optimizer optimizing J starting in w0. If we apply affine linear scaling to cost J such that J(w) = a J(w) + b and start optimization in the same point, i.e. w0 = w0, then we expect a scale invariant optimizer to select If we scale inputs on the other hand (or more generally map them with a bijection φ), then we expect for J := J φ and starting point w0 = φ 1(w0), that this relationship is retained by a scale invariant optimizer, i.e. wn = φ 1(wn). Why do we use a different starting point? As an illustrating example, assume that φ maps miles into kilometers. Then J accepts miles, while J accepts kilometers. Then we have to map the initial starting point w0 of J measured in kilometers into miles w0. φ 1 is precisely this transformation from kilometers into miles. A scale invariant optimizer should retain this relation, i.e. no matter if the input is measured in miles or kilometers the same points are selected. Proof. The following proof will be split into three parts. The first two parts of the proof will address a more general audience and ignore the mathematical subtleties we discussed in Section D.1.1. In the third part we explain to the interested probabilists how to resolve these issues. 1. Invariance with regard to affine linear scaling Let J(w) := a J(w) + b where a > 0 and b R and assume w0 = w0. With the induction start given, we only require the induction step to prove wn = wn. 3E.g. consider the random variable W = argmin J 1J( w)>0 + argmax J 1J( w)<0. In this case, J(W) is much more informative of J( w) than J(w) at some deterministic w. For the induction step, we assume this equation holds up to n. Since φ(x) = ax + b is a measurable bijection, the sigma algebra4 generated by ( J(wn), J(wn)) = (φ J(wn), a J(wn)) is therefore equal to the sigma algebra generated by (J(wn), J(wn)). This implies wn+1 = argmin w E[ J(w) | J( wn), J( wn)] induction = argmin w E[ J(w) | J(wn), J(wn)] sigma alg. = argmin w E[ J(w) | J(wn), J(wn)] linearity = argmin w a E[J(w) | J(wn), J(wn)] + b monotonicity = argmin w E[J(w) | J(wn), J(wn)] def. = wn+1 Where we have used the linearity of the conditional expectation and the strict monotonicity of φ(x) = ax + b. 2. Invariance with regard to certain input bijections Let φ be a differentiable bijection whose jacobian is invertible everywhere and assume J := J φ. Since φ is a bijection, φ(M) is the domain of J whenever M is the domain of J. For a starting point w0 φ(M) we now assume w0 = φ 1(w0) M and are again going to prove the claim wn = φ 1(wn). by induction. Assume that we have this claim up to n. Then we have by induction J( wn) = J φ(φ 1(wn)) = J(wn) (21) and J( wn) = wn(J φ( wn)) = φ ( wn)( J)(φ( wn)) = φ ( wn) J(wn). Since φ ( wn) is invertible by assumption, the sigma algebras generated by ( J( wn), J( wn)) and J(wn), J(wn) are identical. But this results in the induction step wn+1 = argmin w M E[ J(w) | J( wn), J( wn)] sigma alg. = argmin w M E[ J(w) | J(wn), J(wn)] def. = argmin w M E[J φ(w) | J(wn), J(wn)] = φ 1 argmin θ φ(M) E[J(θ) | J(wn), J(wn)] where we simply optimize over θ = φ(w) instead of w and correct the argmin at the end. 3. Addressing the subtleties In equation (20) we have really proven for deterministic w ΦP J(w, J(w), J(w)) = ΦPJ(w, J(w), J(w)). But this implies with the induction assumption Wn = Wn Wn+1 = ΦP J( Wn, J( Wn), J( Wn)) ind. = ΦPJ(Wn, J(Wn), J(Wn)) = Wn+1. 4if you are unfamiliar with sigma algebras read them as information . Similarly we have proven in (22) that ΦP J φ 1(w), J(φ 1(w)), J(φ 1(w)) = φ 1 ΦPJ(w, J(w), J(w)) . By the induction assumption W = φ 1(Wn), this implies Wn+1 = ΦP J( Wn, J( Wn), J( Wn)) ind. = ΦP J φ 1(Wn), J(φ 1(Wn)), J(φ 1(Wn)) = φ 1 ΦPJ(Wn, J(Wn), J(Wn)) = φ 1(Wn+1). D.2 Section 4: Relation to gradient descent Lemma 4.1 (Explicit first order stochastic Taylor approximation). For J N(µ, C), the first order stochastic Taylor approximation is given by E[J(w d) | J(w), J(w)] = µ + C d 2 C(0) (J(w) µ) C d 2 C (0) d, J(w) . Proof. (J(w), J(w), J(w d)) is a Gaussian vector for which the conditional distribution is well known. It is only necessary to calculate the covariance matrix. The key ingredient here is to observe that J(w), 1J(w), . . . , d J(w) are all independent, trivializing matrix inversion. More formally, by Lemma G.2 we have Cov J(w) J(w) = C(0) C (0)Id d Cov J(w d), J(w) J(w) 2 ) C ( d 2 By Theorem G.1 we therefore know that E[J(w d) | J(w), J(w)] = µ + 2 ) C ( d 2 !T C(0) C (0)Id d 1 J(w) µ J(w) which immediately yields the claim. Theorem 4.2 (Explicit RFD). Let J N(µ, C), then RFD coincides with gradient descent wn+1 = wn η n J(wn) J(wn) , where the RFD step sizes are given by η n := argmin η R C(0) (J(wn) µ) η C η2 C (0) J(wn) . (1) Proof. The explicit version of RFD follows essentially by fixing the step size η = d and optimizing over the direction first. With Lemma 4.1 we have min d E[J(w d) | J(w), J(w)] = min η 0 min d: d =η µ + C η2 C(0) (J(w) µ) C η2 C (0) d, J(w) = min η 0 µ + C η2 C(0) (J(w) µ) C η2 max d: d =η d, J(w) C ( η2 2 ) C (0) 0 min d: d =η d, J(w) C ( η2 2 ) C (0) < 0. By Lemma G.3 and Corollary G.4 the maximizing or minimizing step direction is then given by d(η) = η J(w) Where it is typically to be expected, that we have a positive sign. Since that depends on the covariance though, we avoid this problem with the following argument: Since η only appears as η2 in the remaining equation, we can optimize over η R in the outer minimization instead of over η 0 to move the sign into the step size η and set without loss of generality d(η) = η J(w) Since d(η), J(w) = η J(w) the remaining outer minimization problem over the step size is then given by min η R C η2 C(0) (J(w) µ) η C η2 C (0) J(w) , Its minimizer is by definition the RFD step size as given in the Theorem. D.3 Section 5: RFD-step sizes Proposition D.2 (Tayloring the step size optimization problem). The second order Taylor approximation of the step size optimization problem qΘ(η) = C η2 C(0) η C η2 around zero is given by T2qΘ(η) = 1 ηΘ + η2 C (0) 2C(0) minimized by ˆη := argmin η T2qΘ(η) = C(0) C (0)Θ. Furthermore, the Taylor residual is bounded by q(η) T2q(η) η3c0 η with c0 = 1 2 max{supθ [0,1] |C (θ)|, |C (0)|}( 1 C(0) + 1 |C (0)|) < . Proof. Using the Taylor approximation with the mean value reminder for C, we get 2 = C(0) + C (0) η2 2 = C (0) + C (θ1) η2 for some θ1, θ2 [0, η2 2 ]. This implies q(η) 1 + C (0) | {z } =:T2qΘ(η) By the following error the optimistically defined T2qΘ(η) is really the second Taylor approximation (which can be confirmed manually, but we deduce it by arguing that its residual is in O(η3)). More specifically, q(η) T2q(η) η3 sup θ [0, η2 2 ] |C (θ)| 2C(0) η 4 + sup θ [0, η2 2 ] |C (θ)| 2|C (0)| Θ Lem. D.8 It is easy to see for J(w) < µ that T2q(η) is a convex parabola due to C (0) < 0. We thus have ˆη := argmin η T2qΘ(η) = C(0) C (0)Θ. Theorem D.3 (Details of Proposition 5.2). Let J N(µ, C) and assume there exists η0 > 0 such that the correlation for larger distances η η0 are bounded smaller than 1, i.e. C(η2/2) C(0) < ρ (0, 1). Then there exists K, Θ0 > 0 such that for all Θ < Θ0 ˆη(Θ) 1 + KΘ. In particular we have η (Θ) ˆη(Θ) as Θ 0 or equivalently as ˆη 0. Proof. This follows immediately from Lemma D.4, Lemma D.5 and Lemma D.6. Corollary 5.3. Assume η 0 implies Θ 0, the cost J is bounded, has continuous gradients and RFD converges to some point w . Then w is a critical point and the RFD step sizes η are asymptotically equal to ˆη. Proof. Assuming RFD converges, its step sizes η converge to zero. But this implies Θ 0 by assumption, i.e. Since J(w) is bounded, this implies J(w) 0 and by continuity the of the gradient, it is zero in its limit. Thus we converge to a stationary point. The asymptotic equality follows by Lemma D.4 and Lemma D.5, as we know η converges so we do not require the assumptions of Lemma D.6. D.3.1 Locating the Minimizer In the following we want to rule out locations for the RFD step size η by proving qΘ(η) > qΘ(ˆη) for a wide range of η. For this endeavour the relative position of the step size η relative to ˆη is a useful re-parametrization η := η(λ) = λˆη. Due to ˆη = C(0) C (0)Θ we obtain T2qΘ(η) = 1 ηΘ + η2 C(0) = 1 + λ( λ On the other hand we have for the bound |qΘ(η) T2qΘ(η)| λ3ˆη3c0 λ C(0) 4|C (0)| + 1 Θ Since ˆη = η(1) we thus obtain qΘ(η) qΘ(ˆη) ˆηΘ T2qΘ(η) |qΘ(η) T2qΘ(η)| T2qΘ(ˆη) |qΘ(ˆη) T2qΘ(ˆη)| ˆη2c0 h λ3 λ C(0) 4|C (0)| + 1 + C(0) 4|C (0)| + 1 i 2(1 λ)2 ˆη2c0 h λ3 λ C(0) 4|C (0)| + 1 + C(0) 4|C (0)| + 1 i . (23) This equation will be the basis of a number of lemmas ruling out various step sizes as minimizers. Lemma D.4 (Ruling out small step sizes). If the step size is (much) smaller than the asymptotic step size ˆη = ˆη(Θ), then it can not be a minimizer. More specifically ˆη [0, 1 c1Θ) = qΘ(η) > qΘ(ˆη) where c1 := 2 C(0) c0 C(0) 4|C (0)| + 1 < . Proof. Here we consider the case η ˆη, i.e. λ [0, 1]. By (23) we have qΘ(η) qΘ(ˆη) 2(1 λ)2 ˆη2c0 h λ3 λ C(0) 4|C (0)| + 1 + C(0) 4|C (0)| + 1 i 2(1 λ)2 2ˆη2c0 C(0) 4|C (0)| + 1 for which (1 λ)2 > 4ˆη2c0 C(0) 4|C (0)| + 1 is sufficient or equivalently c0 C(0) 4|C (0)| + 1 = 1 Θ 2 C(0) c0 C(0) 4|C (0)| + 1 | {z } =:c1 So for λ [0, 1 Θc1) we have qΘ(η) > qΘ(ˆη). Lemma D.5 (Ruling out medium sized step sizes as minimizer). For c2 = 2c1 and Θ Θ0 := 1 5c1 , we have η ˆη 1 + c2Θ, 1 c2Θ = qΘ(η) > qΘ(ˆη) Proof. Here we consider the case λ 1, i.e. η > ˆη. Again starting with (23) we get 2(1 λ)2 ˆη2c0 h λ3 λ C(0) 4|C (0)| + 1 + C(0) 4|C (0)| + 1 i 2(λ 1)2 2λ4ˆη2c0 C(0) 4|C (0)| + 1 λ 1 > 2λ2ˆη c0 C(0) 4|C (0)| + 1 = c1Θλ2 or equivalently λ 1 c1Θλ2 > 0 is sufficient. Note that this is a concave parabola in λ. So it is positive between its zeros which are characterized by c1Θλ2 λ + 1 = 0. They are thus given by λ1/2 = 1 1 4c1Θ So whenever λ (λ1, λ2) we have that qΘ(η) > qΘ(ˆη). In particular for 4c1Θ 1 or equivalently Θ 1 4c1 we have λ2 1 2c1Θ = 1 c2Θ To get a bound on λ1 note that the original equation was essentially λ 1 + c1Θλ2 with equality for λ = λ1, if Θ is reduced, the inequality remains, which implies that λ1 is decreasing with Θ. So assuming the inequality is satisfied for a particular λ e.g. λ = 2 which requires 2 1 + 2c1Θ Θ then we know that λ1 2 for all smaller Θ. This implies for Θ Θ0 = 1 5c1 λ1 = 1 + c1Θλ2 1 1 + 2c1 |{z} c2 Θ. Lemma D.6 (Ruling out large step sizes as minimizer). If there exists step size η0 > 0 such that the correlation is bounded by some ρ < 1, i.e. C(0) ρ (0, 1), for larger step sizes η η0, then there exist Θ0 > 0 such that for all Θ < Θ0 η ˆη 1 + c2Θ, = q(η) > q(ˆη), where c2 is the constant from Lemma D.5. Proof. The upper bound 1 c2Θ in Lemma D.5 is only due to the loss of precision of the Taylor approximation. To remove it, we take a closer look at the actual qΘ itself. We have the following bound for our asymptotic minimum Θ T2qΘ(ˆη) + |qΘ(ˆη) T2qΘ(ˆη)| 2 ˆη + ˆη3 c0 C(0) 4|C (0)| + 1 | {z } =:c3 Which means we have for qΘ(η) qΘ(ˆη) C (0) ˆη3c3 C (0) ˆη3c3 C (0) ˆη3c3 where we use the assumption that there exists ρ (0, 1) such that ρ C η2 C(0) for all η η0 and the fact that we only need to consider η 1 c2Θ (due to Lemma D.5) which allows a translation of η0 into some maximal Θ0. Note that ˆη Θ vanishes as Θ 0, so eventually the term (1 M) 1 Θ dominates. Selecting Θ0 small small enough is thus sufficient to cover everything that is not already covered by Lemma D.5. D.3.2 Technical bounds Lemma D.7 (Bound on the first derivative of the covariance). sup η 0 |C η2 Proof. Since we have Cov(Dv J(x), J(y)) = C x y 2 we have for a standardized vector v = 1 and x y = ηv by Cauchy-Schwarz 2 η| = | Cov(Dv J(x), J(y))| C.S. p Var(Dv J(x)) Var(J(y)) = p As the bound is independent of η this yields the claim. Lemma D.8 (Bound on the second derivative of the covariance). sup θ 0 |C (θ)| max n sup θ [0,1] |C (θ)|, |C (0)| o Proof. Note that Cov(Dv J(x), Dw J(y)) = C x y 2 2 x y, v x y, w C x y 2 Selecting v, w as orthonormal vectors (e.g. v = e1, w = e2) and x y := η(v + w) for some η > 0 results in x y 2 = 2η2 and thus by the Cauchy-Schwarz inequality C (η2)η2 = Cov(Dv J(x), Dw J(y)) C.S. p Var(Dv J(x)) Var(Dw J(y)) = p This implies the claim. D.4 Section 6: Stochastic loss Lemma D.9. The stochastic approximation errors ϵi(w) := ℓi(w) J(w) are identically distributed, centered random functions, which are independent conditional on f. In particular, E[ϵi(w)ϵj( w)] = E[ϵi(w)ϵj( w) | f] = 0 j = i. Proof. The ϵi are independent random functions conditional on f, since for any n N, any bounded measurable functions h and g E h h ϵi(w1), . . . , ϵi(wn) g ϵj(w1), . . . , ϵj(wn) | f i = h h ϵi(w1), . . . , ϵi(wn) E h g ϵj(w1), . . . , ϵj(wn) | f, Xi, ςi i ( ) = E g(ϵj(w1),...,ϵj(wn))|f = E h h ϵi(w1), . . . , ϵi(wn) | f i E h g ϵj(w1), . . . , ϵj(wn) | f i , where ( ) uses the fact that ϵj does not depend on the independent Xi, ςi. Since almost by definition E[ϵi | f] = E[ℓ( , Xi, Yi) | f] J( ) = 0, the stochastic approximation errors are thus uncorrelated E[ϵiϵj] = E h E[ϵiϵj | f] i = E h E[ϵi | f]E[ϵj | f] i = 0. Extension 6.2 (S-RFD). For loss J N(µ, C) and stochastic errors ϵi iid N(0, Cϵ) we have argmin d E[J(w d) | Lb(w), Lb(w)] = η (Θ) Lb(w) with the same step size function η as for RFD, but modified Θ Θ = C (0) C (0) + 1 b C ϵ(0) C(0) + 1 b Cϵ(0) C(0) Lb(w) µ Lb(w). Proof. Since ϵi are conditionally independent between each other and to J, as entire functions, the same holds true for ϵi. As all the mixed covariances disappear we have Cov Lb(w) Lb(w) = Cov J(w) J(w) i=1 Cov ϵi(w) ϵi(w) = C(0) C (0)Id d Cϵ(0) C ϵ(0)Id d b Cϵ(0) C (0) + 1 b C ϵ(0) Id d. by Lemma G.2. If you want to break up the first step we recommend considering individual entries of the covariance matrix to convince yourself that all the mixed covariances disappear. Together with the fact Cov J(w d), Lb(w) Lb(w) = Cov J(w d), J(w) J(w) i=1 Cov J(w d), ϵi(w) ϵi(w) 2 ) C ( d 2 The rest is analogous to Lemma 4.1 and Theorem 4.2, so we only sketch the remaining steps. Applying Theorem G.1 as in Lemma 4.1 we obtain a stochastic version of the stochastic Taylor approximation ( stochastic2 Taylor approximation perhaps?) E[J(w d) | Lb(w), Lb(w)] = µ + C d 2 b Cϵ(0)(Lb(w) µ) C d 2 b C ϵ(0) d, Lb(w) . Minimizing this subject to a constant step size as in Theorem 4.2 results in η = argmin η R b Cϵ(0)(Lb(w) µ) η C d 2 b C ϵ(0) Lb(w) = argmin η R C d 2 C(0) η C d 2 b C ϵ(0) C(0) C(0) + 1 b Cϵ(0) Lb(w) µ Lb(w), where we divided the term by C(0) C(0)+ 1 b Cϵ(0) 1 µ Lb(w) 0 to obtain the last equation. The claim follows by definition of η (Θ) and our redefinition of Θ. E Extensions In this section we present a few possible extensions to Theorem 4.2, which are all composable, i.e. it is possible to combine these extensions without any major problems (including S-RFD, i.e. Extension 6.2). E.1 Geometric anisotropy/Adaptive step sizes In this section, we discuss the generalization of isotropy to geometric anisotropies [50, p. 17], which provide good insights into the inner workings of adaptive learning rates (e.g. Ada Grad [14] and Adam [29]). Definition E.1 (Geometric Anisotropy). We say a random function J exhibits a geometric anisotropy , if there exists an invertible matrix A such that J(x) = g(Ax) for some isotropic random function g. This implies that the expectation of J is still constant (E[J(x)] = E[g(Ax)] = µ) and the covariance function of J is given by Cov(J(x), J(y)) = Cov(g(Ax), g(Ay)) = C A(x y) 2 = C x y 2 AT A 2 where Σ is the norm induced by x, y Σ := x, Σy for some strictly positive definite matrix Σ = AT A. Here (24) characterizes the set of random functions with a geometric anisotropy in the Gaussian case, because for an J with such a covariance we can always obtain an isotropic g by g(x) := J(A 1x). This is the whitening transformation we suggest looking for in order to ensure isotropy in the context of scale invariance (Section 2). An important observation is, that Theorem F.2 implies that J is still stationary, so the distribution of J is still invariant to translations. If stationarity is a problem, this is therefore not the solution. But geometric anisotropies are a beautiful model to explain preconditioning and adaptive step sizes. For this, we first determine the RFD steps. Extension E.2 (RFD steps under geometric anisotropy). Let J be a Gaussian random function which exhibits a geometric anisotropy A and is based on an isotropic random function g N(µ, C). Then the RFD steps are given by η Σ 1 J(w) Σ 1 J(w) Σ = argmin d E[J(w d) | J(w), J(w)] η = argmin η qΘ(η) where Θ = Σ 1 J(w) Σ Proofsketch. There are two ways to see this. Either we apply scale invariance (Advantage 2.3) directly to translate the steps on g into steps on J. Alternatively one can manually retrace the steps of the proof. Details in Subsection E.1.1 The step direction is therefore and Σ 1 acts as a preconditioner. So how would one obtain Σ? As it turns out the following holds true (by Lemma G.2) E[ J(w) J(w)T ] = AT E[ g(w) g(w)T ]A = AT ( C (0)I)A = C (0)Σ In their proposal of the first adaptive method, Ada Grad, Duchi et al. [14] suggest to use the matrix k=1 J(wk) J(wk)T , which is basically already looking like an estimation method of Σ. They then restrict themselves to diag(Gt) due to the computational costs of a full matrix inversion. This results in entry-wise ( adaptive ) learning rates. Later adaptive methods like RMSProp [22], Ada Delta [57] and in particular Adam [29] replace this sum with an exponential mean estimate, i.e. in the case of Adam the decay rate β2 is used to get an exponential moving average vt = β2vt 1 + (1 β2) diag( J(wt) J(wt)T ) = β2vt 1 + (1 β2)( J(wt))2. They then take the expectation E[vt] = E h (1 β2) k=1 βt k 2 J(wk)2i = E h (1 β2) k=1 βt k 2 J(wk)2i = (1 βt 2) E[ J(xt)2] | {z } diag(Σ) So ˆvt = vt/(1 βt 2) in the Adam optimizer is essentially an estimator for diag(Σ). It is noteworthy, that Kingma and Ba [29] already used the expectation symbol. This is despite the fact, that they did not yet model the optimization objective J as a random function. We can not yet explain why they then use the square root of their estimate diag(Σ) 1/2 instead of diag(Σ) 1 itself. This might have something to do with the fact that the estimation of Gt happens online and the J(wk) are therefore highly correlated. Another reason might be that the inverse of an estimator has different properties than the estimator itself. Finally, the fact that only the diagonal is used might also be the reason, if the preconditioner diag(Σ) 1/2 is simply better when we restrict ourselves to diagonal matrices. E.1.1 Proof of Extension E.2 Since the application of scale invariance provides no intuition, we provide a proof which retraces some of the steps of the original proof. Recall, that for an isotropic random function g we have the stochastic Taylor approximation E[g(w d) | g(x), g(x)] = µ + C d 2 C(0) (g(w) µ) + C d 2 C (0) d, g(w) This implies for a random function with geometric anisotropy J(w) = g(Aw) that E[J(w d) | J(w), J(w)] = E[g(A(w d)) | g(Aw), g(Aw)] = µ + C Ad 2 C(0) (g(Aw) µ) C Ad 2 C (0) Ad, g(Aw) = µ + C d 2 Σ 2 C(0) (J(w) µ) C d 2 Σ 2 C (0) d, AT g(Aw) | {z } = J(w) with Σ := AT A. As in the original proof, we now optimize over the direction first, while keeping the step size constant, although we now fix the step size with regard to the norm Σ (which basically means that we still do the optimization in the isotropic space). Note that max d d, J(x) s.t. d Σ = η is equivalent to max d d, Σ 1 J(x) Σ s.t. d Σ = η which is solved by η Σ 1 J(x) Σ 1 J(x) Σ The remainder of the proof is exactly the same as in the original. E.2 Conservative RFD In the first paragraph of Section 2 we motivated the relation between RFD and classical optimization with the observation, that gradient descent is the minimizer of a regularized first order Taylor approximation 1 L J(w) = argmin d T[J(w d) | J(w), J(w)] + L This regularized Taylor approximation is in fact an upper bound on our function under the Lsmoothness assumption [38], i.e. J(w d) T[J(w d) | J(w), J(w)] + L An improvement on of this upper bound compared to J(x) therefore guarantees an improvement of the loss. This guarantee was lost with the conditional expectation (on purpose, as we wanted to consider the average case). Losing this guarantee also makes convergence proofs more difficult as they typically make use of this improvement. In view of the confidence intervals of Figure 1, it is natural to ask for a similar upper bound in the random setting, where this can only be the top of an confidence interval. This is provided in the following theorem Lemma E.3 (An γ-upper bound). We have P J(w d) E[J(w d) | J(w), J(w)] + ργ( d 2) γ for ργ(η2) := Φ 1(γ)σ(η2) with σ2(η2) := C(0) C η2 where Φ is the cumulative distribution function (cdf) of the standard normal distribution. Proof. Note that the conditional variance is with the usual argument about the covariance matrices (cf. the proof of Thoerem 4.2) using Lemma G.2 and an application of Theorem G.1 given by σ2( w 2) := Var[J(w d) | J(w), J(w)] = C(0) C d 2 Since the conditional distribution is normal (by Theorem G.1), we have J(w d) E[J(w d) | J(w), J(w)] σ( w 2) N(0, 1). But this implies the claim P J(w d) E[J(w d) | J(w), J(w)] + ργ( d 2) = P J(w d) E[J(w d) | J(w), J(w)] σ( w 2) Φ 1(γ) = Φ(Φ 1(γ)) = γ. To avoid the Gaussian assumption, one could apply the Markov inequality instead, or another applicable concentration inequality. Using this upper bound, we obtain a natural conservative extension of RFD Extension E.4 (γ-conservative RFD). Let J N(µ, C) and ργ(η2) = Φ 1(γ)σ(η2), where σ is the conditional standard deviation as defined in Lemma E.3. Then the conservative RFD step direction is given by J(w) = argmin d E[J(w d) | J(w), J(w)] + ργ( d 2) and the γ-conservative RFD step size is given by η = argmin η C(0) (J(w) µ) η C η2 C (0) J(w) + ργ(η2). Proof. The proof is the same as in Theorem 4.2 with Lemma 4.1 replaced by Lemma E.3. Taking multiple steps should generally have an averaging effect, so we expect faster convergence for almost risk neutral minimization of the conditional expectation (i.e. γ 1 2). Here γ is a natural parameter to vary conservatism. In a software implementation it might be a good idea to call this parameter conservatism and rescale it to be in [0, 1] instead of [ 1 2, 1]. But formulas look cleaner with γ. In Bayesian optimization it is much more common to reverse this approach and minimize a lower confidence bound ( conservatism < 0 or γ < 1 2) in order to encourage exploration. But since RFD is forgetful, this is not a good idea for RFD. Remark E.5 (Conservative RFD coincides asymptotically with RFD in high dimension). While conservative RFD might seem like a good approach to fix the instability of RFD under the isotropy assumption on some optimization problems, the variance generally vanishes in high dimension [see 5] and conservative RFD coincides asymptotically with RFD. We therefore believe that the underlying issue is not an overly risk-affine algorithm but rather that distributional assumptions, in particular the stationarity assumption, are violated when instabilities occur (cf. Section F). Nevertheless, conservative RFD might be a good approach for lower dimensional, risk-sensitive applications. E.3 Beyond the Gaussian assumption In this section we sketch how the extension beyond the Gaussian case using the best linear unbiased estimator BLUE [e.g. 27, ch. 7] works. For this we recapitulate what a BLUE is. A linear estimator ˆY of Y using X1, . . . , Xn is of the form ˆY span{X1, . . . , Xn} + R. The set of unbiased linear estimators is defined as LUE = LUE [Y | X1, . . . , Xn] = { ˆY span{X1, . . . , Xn} + R : E[ ˆY ] = E[Y ]} (25) = { ˆY + E[Y ] : ˆY span{X1 E[X1], . . . , Xn E[Xn]}}. And the BLUE is the best linear unbiased estimator, i.e. BLUE[Y | X1, . . . , Xn] := argmin ˆY LUE E[ ˆY Y 2]. (26) Other risk functions to minimize are possible, but this is the usual one. Lemma E.6. If X, Y1, . . . , Yn are multivariate normal distributed, then we have BLUE[Y | X1, . . . , Xn] = E[Y | X1, . . . , Xn] = argmin ˆY {f(X1,...,Xn):f meas.} E[ Y ˆY 2] . Proof. We observe that the conditional expectation of Gaussian random variables is linear (Theorem G.1). So as a linear function its L2 risk must be larger or equal to that of the BLUE. And as an L2 projection [30, Cor. 8.17] the conditional expectation was already optimal. If we now replace the conditional expectation with the BLUE, then all our theory remains the same because the result in Theorem G.1 remains the BLUE for general distributions [27]. Instead of minimizing min d E[J(w d) | J(w), J(w)] we can therefore always minimize min d BLUE[J(w d) | J(w), J(w)] without the Gaussian assumption and all our results can be translated to this case. The reader only needs to replace all mentions of Theorem G.1 with the BLUE equivalent and replace all idependence claims with uncorrelated . F Input invariance In this section we generalize the notion of isotropy to non-stationary isotropy and discuss why we believe this generalization is necessary. Recall that we motivated isotropy as an invariant distribution with regard to isometric transformations of the input. In particular its distribution stays invariant with regard to translations (also known as stationarity), which we do not believe plausible for cost functions, because the cost at zero J(0) behaves fundamentally different from the cost of any other parameter vector. In the following we will therefore generalize this notion to general input invariant distributions. And we will discuss their applicability to machine learning after we characterize the named categories. Definition F.1 (Input Invariance). A random function f is Φ-input invariant, if 5 Pf = Pf φ φ Φ. For certain sets of Φ we give these Φ-input invariant distributions names 5The input to a random function is somewhat ambiguous since it is a random variable, i.e. function from the probability space Ωinto function space, so its first input should be ω Ω. Formally, the definition should therefore be: For all measurable sets of functions A Pf(φ 1 (A)) = Pf(A) φ Φ where φ : f 7 f φ denotes the pullback. But this is less helpful for an intuitive understanding. If Φ is the set of isometries, we call f (stationary) isotropic. If Φ is the set of translations, we call f stationary. If Φ is the set of linear isometries, we call f non-stationary isotropic. We further say a random function f is n-weakly Φ-input invariant, if for all φ Φ, all k n and all xi E[f(φ(x1)) f(φ(xk))] = E[f(x1) f(xk)]. Since second moments fully determine Gaussian distributions, 2-weakly input invariance is special, because it is equivalent to full input invariance in the Gaussian case. So an omitted n equals 2. Weakly isometry invariant naturally becomes weakly isotropic , etc. While stationary and stationary isotropic random functions are well known [e.g. 44, 1], we are not aware of research on non-stationary isotropy although we doubt the concept is new. It turns out that the different notions of input isometry have simple characterizations in terms of the covariance functions. We present these in Theorem F.2 of which the stationary isotropic and stationary case are already well known. Theorem F.2 (Characterization of Weak Input Invariances). Let f : Rd R be a random function, then f is 1. weakly stationary, if and only if there exists µ R and function C : Rd R such that for all x, y µf(x) = µ, Cf(x, y) = C(x y). 2. weakly non-stationary isotropic, if and only if there exist functions µ : R 0 R and κ : D R with D = {λ R2 0 R : |λ3| 2 λ1λ2} R3. such that for all x, y µf(x) = µ x 2 Cf(x, y) = κ x 2 3. weakly stationary isotropic, if and only if there exists µ R and a function C : R 0 R such that for all x, y µf(x) = µ, Cf(x, y) = C x y 2 Proof. The proof essentially follow as a corollary from a characterization of isometries (Proposition F.4). For details see Subsec F.2. Non-stationary isotropy is therefore a generalization of stationary isotropy. It allows the zero parameter vector to have special meaning because the distribution is only invariant to linear isometries (i.e. rotations and reflections) which keep the zero in place. It is important to highlight, that a geometric anisotropy (Section E.1) retains stationarity, while breaking non-stationary isotropy. A similar geometric generalization could also be applied to nonstationary isotropy. Another important observation is the fact, that non-stationary isotropy coincides with stationary isotropy on the sphere. I.e. when x and y are constant, the function only depends on x y and the mean is also constant. In other words, we have stationary isotropy on the sphere. Isotropy might therefore get by as an assumption in machine learning, as parameters are typically initialized on the sphere. This is because Glorot intialization [18] samples parameter entries independently, so their squared norm i=1 (w(i))2 is a sum of independent random variables which are normalized such that a law of large numbers applies. Up to small variance their lengths are therefore all the same, and are placed on a sphere at this radius. If we leave this sphere, this equivalence stops being true. Weight normalization [46], batch normalization [25], weight decay [e.g. 19] or equivalently L2 regularization, etc. might all contribute to keep this assumption intact. But in the following section we will see, that even simple linear regressions considered by researches investigating the average case behavior on quadratic functions [e.g. 58, 43, 33, 12, 9, 40, 41], require non-stationary isotropy. Moreover the covariance kernels suggested by investigations into random neuronal networks [e.g. 54, 8] are also non-stationary isotropic but not stationary isotropic. F.1 Random linear regression In this section, we determine the distribution of the cost function induced by a simple linear regression. For this we define the mean squared sample loss ℓi(w) = (Y fw(X))2, where the random data X is mapped by the true relationship f to labels Y = f(X) and fw(x) = x, w is a linear model. If the true relationship f is also a random linear function f(x) = θ, x with random signal θ N(0, I) independent of input X N(0, I), then the cost function is given by J(w) = E[ℓi(w) | f] = E[ θ w, X 2 | θ] = (θ w)T E[XXT ](θ w) Lemma F.3. The expectation and covariance of J are given by E[J(w)] = const. + w 2 Cov(J(w), J( w)) = const. + 4 w, w In particular, the cost J is non-stationary isotropic, but not stationary isotropic. Proof. Its expectation is given by E[J(w)] = E[ θ w 2] = E[ θ 2] 2 E[θ] |{z} =0 = const. + w 2 In particular it is not constant, but dependent on w 2, which means that we do not have stationary isotropy. But there is still hope for non-stationary isotropy, and this is essentially true as can be seen by calculating Cov(J(w), J( w)) = E h (J(w) E[J(w)])(J( w) E[J( w)]) i = E h ( θ 2 E θ 2 2 θ, w )( θ 2 E θ 2 2 θ, w ) i = Var( θ 2 E θ 2) 2E[( θ 2 E θ 2) θ, w ] 2E[( θ 2 E θ 2) θ, w ] + 4w T E[θθT ] w = Var( θ 2 E θ 2) + 4 w, w = const. + 4 w, w because the terms in the middle are zero, e.g. E[( θ 2 E θ 2) θ, w ] = E[ θ 2θ] | {z } =0 , w E θ 2 E[θ] |{z} =0 where the entries of E[ θ 2θ] are zero, because of independence and first moments being zero and third moments being zero. F.2 Proof of Theorem F.2 Proposition F.4 (Characterizing isometries). Let X be a vectorspace and xi, yi X for i = 1, . . . , n, then the following pairs of statements are equivalent 1. (a) xi xj = yi yj for all i, j (b) there exists a translation φ with φ(xi) = yi for all i. In the remainder we further assume X to be a Hilbertspace, 2. (a) xi = yi and xi xj = yi yj for all i, j (b) there exists a linear isometry φ with φ(xi) = yi for all i. 3. (a) xi xj = yi yj for all i, j (b) there exists an (affine) isometry φ with φ(xi) = yi for all i. Proof. (1a) (1b): we define φ(x) := x + (y0 x0), which implies φ(xi) = xi x0 + y0 = (yi y0) + y0 = yi. (1b) (1a): Let φ(x) = x + c for some c. Then we immediately have yi yj = φ(xi) φ(xj) = xi + c (xj + c) = xi xj. (2a) (2b): By the polarization formula, for all i, j xi, xj = xi 2 + yi 2 xi xj 2 2 = yi, yj . We apply the Gram-Schmidt orthonormalization procedure to both xi and yi such that Ukn = span(u1, . . . , ukn) = span(x1, . . . , xn) for orthonormal ui where we skip xm if it is already in Ukm 1 (resulting in km = km 1), and similarly Vkn = span(v1, . . . , vkn) = span(y1, . . . , yn). Since this procedure only uses scalar products, we inductively get xk, uj = yk, vj k, j We now extend ui and vi to orthonormal basis of X and define the linear mapping by its behavior on the basis elements φ : ui 7 vi. Mapping an orthonormal basis to an orthonormal basis is an isometry and we have φ(xk) = φ k X j=1 xk, uj uj j=1 xk, uj φ(uj) = j=1 yk, vj vj (2b) (2a): Isometries preserve distances by definition. This implies xi xj = yi yj . And linear functions map 0 to 0, so we have xi = xi 0 = φ(xi) φ(0) = yi . (3a) (3b): We define xi = xi x0 and similarly for y. In particular, x0 = y0 = 0. Since xi and yi satisfy the requirements of 2, there exists a linear isometry φ with φ( xi) = yi. Then the isometry φ : x 7 φ(x x0) + y0 does the job. (3b) (3a): This is precisely the distance preserving property of Isometries. Theorem F.2 (Characterization of Weak Input Invariances). Let f : Rd R be a random function, then f is 1. weakly stationary, if and only if there exists µ R and function C : Rd R such that for all x, y µf(x) = µ, Cf(x, y) = C(x y). 2. weakly non-stationary isotropic, if and only if there exist functions µ : R 0 R and κ : D R with D = {λ R2 0 R : |λ3| 2 λ1λ2} R3. such that for all x, y µf(x) = µ x 2 Cf(x, y) = κ x 2 3. weakly stationary isotropic, if and only if there exists µ R and a function C : R 0 R such that for all x, y µf(x) = µ, Cf(x, y) = C x y 2 Proof. Starting from the mean and covariance function it is easy to check 2-weak non-stationary isotropy. So we only need to check the other direction. The proof is essentially an application of Prop. F.4. For brevity (and since the other two results are well known), we will only prove the weakly non-stationary isotropic case (the other two cases can be proven with minor adjustments to the proof). Without loss of generality, we will find the slightly different representation E[fd(x)] = µ( x ) and Cfd(x, y) = κ( x , y , x, y ), where the domain of κ is given by D = {λ R2 0 R : |λ3| λ1λ2}. The representation of the theorem is then equivalent by a change to µ(λ) := µ λ2 2 and κ(λ1, λ2, λ2) := κ λ2 1 2 , λ2 2 2 , λ3 . First we want to find µ. Let v be some vector (w.l.o.g. v = 1). Then we define µ(r) := E[fd(rv)] Now we need to show that this definition of µ is an appropriate mean function. For this choose any x X. Then for r = x there exists by Prop. F.4 (2.) a non-stationary isometry φ such that φ(x) = rv (we use n = 1). With 1-weak non-stationary isotropy of fd this implies E[fd(x)] = E[fd(rv)] = µ(r) = µ( x ). Next we need to define κ(rx, ry, rxy). For this, choose two orthonormal vectors v, w. For every r = (rx, ry, rxy) D we define x (r) = rxv y (r) = rxy r2y r2 xy r2x w. Where r D ensures |rxy| rxry and thus r2 y r2 xy r2x 0. Then we have x (r) = rx, y (r) = ry, and x (r), x (y) = rxy, (27) and define κ(rx, ry, rxy) := Cfd(x (r), y (r)). Again, we need to show that this kernel does the job. For this, choose any x, y X. For r := ( x , y , x, y ), which is in D by the Cauchy-Schwarz inequality, the induced x (r) and y (r) satisfy by (27) x (r) = x , y (r) = y and x (r) y (r) = x y . By Prop. F.4 (2.) there therefore exists an isometry φ such that φ(x) = x (r) and φ(y) = y (r). By 2-weak input isotropy of fd we conclude Cfd(x, y) isotrop. = Cfd(x (r), y (r)) def. = κ x , y , x, y . G Technical G.1 Conditional Gaussian distribution For the following well known result we found a tidy proof giving insight into the reason it is true, so we wrote it down for your convenience but do not even expect this particular proof to be new. Theorem G.1 (Conditional Gaussian distribution). Let X N(µ, Σ) be a multivariate Gaussian vector where the covariance matrix is a block matrix of the form and Σ = Σ11 Σ12 Σ21 Σ22 then assuming Σ11 is invertible, the conditional distribution of X2 given X1 is X2 | X1 N(µ2|1, Σ2|1), with conditional mean and variance µ2|1 := µ2 + Σ21Σ 1 11 (X1 µ1) Σ2|1 := Σ22 Σ21Σ 1 11 Σ12. Proof. Let X := X µ be the centered version of X. There exists some lower triangular matrix L (even if Σ is only positive semidefinite only not uniquely) such that Σ = LLT (i.e. the Cholesky Decomposition). We can then write without loss of generality X µ =: X1 X2 = L11 0 L21 L22 with independent standard normal Yi, i.e. Y N(0, I). Since Σ11 is invertible, so is L11 and therefore the map from Y1 to X1. Conditioning on X1 is therefore equivalent to conditioning on Y1. But we have X2 = µ2 + X2 = µ2 + L21Y1 | {z } conditional expectation + L22Y2 | {z } conditional distribution So it follows that X2 | X1 N(µ2|1, Σ2|1) with µ2|1 := µ2 + L21Y1 Σ2|1 := L22LT 22. What is left to do, is find a representation for the Lij using the block matrices of Σ. For this note Σ = LLT = L11LT 11 L11LT 21 L21LT 11 L22LT 22 + L21LT 21 This implies L21Y1 = (L21LT 11L T 11 )(L 1 11 X1) = Σ21Σ 1 11 (X1 µ1) so we have the desired conditional expectation, and finally L22LT 22 = Σ22 L21LT 21 = Σ22 L21(LT 11 | {z } =Σ21 L T 11 )(L 1 11 | {z } =Σ 1 11 L11)LT 21 | {z } =Σ12 G.2 Covariance of derivatives By Swapping integration and differentiation we have for a centered random function f Cov( xif(x), f(y)) = E[ xif(x)f(y)] = xi E[f(x)f(y)] = xi Cf(x, y) So the covariance of a derivative of f with f is equal to a partial derivative of the covariance function [more details in 1]. Similarly other covariances can be calculated, e.g. Cov( xif(x), yif(y)) = xi yi Cf(x, y). For this reason the derivatives of the covariance function are interesting as they represent the covariance of derivatives. Applying this observation to isotropic covariance functions Cov(f(x), f(y)) = C x y 2 Lemma G.2 (Covariance of derivatives). Let f N(µ, C) and d = x y, then Cov f(y) jf(y) f(x) C( d 2 2 ) C ( d 2 if(x) C ( d 2 2 ) d, ei h C ( d 2 2 ) d, ej d, ei + C ( d 2 2 ) ej, ei i G.3 Constrained linear optimization Let U be a vectorspace. We define the projection of a vector w onto U by PU(w) := argmin v U v w 2 Lemma G.3 (Constrained maximiziation of scalar products). For a linear subspace U Rd, we have max v U v =λ v, w = λ PU(w) (28) argmax v U v =λ v, w = λ PU(w) Before we get to the proof let us note that this immediately results in the following corollary about minimization. Corollary G.4 (Constrained minimization of scalar products). min v U v =λ v, w = λ PU(w) (30) argmin v U v =λ v, w = λ PU(w) Proof of Corollary G.4. The trick is to move one outside from w = ( w) min v U v =λ v, w = max v U v =λ v, w = λ PU(w) where we have used in the last equation that the projection is linear (we can move the minus sign out) and the norm removes the inner minus sign. The argmin argument is similar. Proof of Lemma G.3. Step 1: We claim that v = λ PU(w) results in the value v , w = λ PU(w) . For this we consider PU(w) = argmin v U v w 2 | {z } = v 2 2 v,w + w 2 (32) = argmin v U v 2 2 v, w | {z } =:f(v) we know that t 7 f(t w U) is minimized at t = 1 by the definition of w U. The first order condition implies dt = 2t PU(w) 2 2 PU(w), w 1 = t = PU(w), w Multiplying both sides by λ PU(w) finishes this step λ PU(w) = D λ PU(w) PU(w) | {z } =v , w E . (33) Step 2: By (33), we know that we can achieve the value we claim to be the maximum (and know the location v to do so). So if we prove that we can not exceed this value, then it is a maximum and v is the argmax. This would finish the proof. What remains to be shown is therefore v, w λ PU(w) v U : v = λ. Let v U with v = λ. Then for any µ R we can plug µv into f from (32) to get µ2λ2 2µ v, w = f(µv) (32) f(PUw) = PU(w) 2 2 PUw, w = PUw, w where the last equation follows from (33) with λ = PUw . Reordering we get for all µ PUw, w + µ2λ2 2µ v, w We now select µ = PUw λ > 0 and divide both sides by µ to get 2 v, w DPU(w) µ | {z } =v (33) = λ PUw +λ PU(w) = 2λ PUw Dividing both sides by 2 yields the claim. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Section 8 recapitulates all the assumptions made and highlights possible generalizations. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: Section D provides all proofs of statements made in the main body and follows an identical structure for easier cross reference. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: While we do not have the necessary space to discuss all implementation details, we believe that we have discussed all relevant insights necessary to reproduce our results. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code of our algorithm is fairly well commented and passes pylint and flake8 linting. The code to perform the benchmarks is less polished and we have not seeded the covariance estimation sampling process, but, since we obtained similar results over multiple runs (Section A.1.1), we are confident that our results are reproducible. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: Training on the MNIST data set is fairly standard, so we feel like our brief outline is sufficient. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: Quantiles are plotted in Figure 3 and the figures of Section A and we provided a histogram of asymptotic learning rates resulting from multiple covariance estimation runs (Section A.1.1). Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [No] Justification: We have not kept careful track of resources used, as MNIST is a fairly small dataset for machine learning standards. We believe the compute was comparatively negligible, although the use of multiple GPUs was helpful in repeating experiments in parallel. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: See broader impacts. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: Since we focus on optimization theory our work has no societal impact beyond the advancement of the the field of Machine Learning, which may have many societal consequences, but none we feel necessary to address. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We cite datasets and models used. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.