# rsn_randomized_subspace_newton__45ac520d.pdf

RSN: Randomized Subspace Newton

Robert M. Gower LTCI, T el ecom Paristech, IPP, France gowerrobert@gmail.com

Dmitry Kovalev KAUST, Saudi Arabia dmitry.kovalev@kaust.edu.sa

Felix Lieder Heinrich-Heine-Universit at D usseldorf, Germany lieder@opt.uni-duesseldorf.de

Peter Richt arik KAUST, Saudi Arabia and MIPT, Russia peter.richtarik@kaust.edu.sa

We develop a randomized Newton method capable of solving learning problems with huge dimensional feature spaces, which is a common setting in applications such as medical imaging, genomics and seismology. Our method leverages randomized sketching in a new way, by ﬁnding the Newton direction constrained to the space spanned by a random sketch. We develop a simple global linear convergence theory that holds for practically all sketching techniques, which gives the practitioners the freedom to design custom sketching approaches suitable for particular applications. We perform numerical experiments which demonstrate the efﬁciency of our method as compared to accelerated gradient descent and the full Newton method. Our method can be seen as a reﬁnement and randomized extension of the results of Karimireddy, Stich, and Jaggi [18].

1 Introduction

In this paper we are interested in unconstrained optimization problems of the form

min x Rd f(x), (1)

where f : Rd R is a sufﬁciently well behaved function, in the large dimensional setting, i.e., when d is very large. Large dimensional optimization problems are becoming ever more common in applications. Indeed, d often stands for the dimensionality of captured data, and due to fast-paced advances in technology, this only keeps growing. One of key driving forces behind this is the rapid increase in the resolution of sensors used in medicine [19], genomics [26, 8], seismology [2] and weather forecasting [1]. To make predictions using such high dimensional data, typically one needs to solve an optimization problem such as (1). The traditional off-the-shelf solvers for such problems are based on Newton s method, but in this large dimensional setting they cannot be applied due to the high memory footprint and computational costs of solving the Newton system. We offer a new solution to this, by iteratively performing Newton steps in random subspaces of sufﬁciently low dimensions. The resulting randomized Newton s method need only solve small randomly compressed Newton systems and can be applied to solving (1) no matter how big the dimension d.

1.1 Background and contributions

Newton s method dates back to even before Newton, making an earlier appearance in the work of the Persian astronomer and mathematician al-Kashi 1427 in his Key to Arithmetic [33]. In the 80 s Newton s method became the workhorse of nonlinear optimization methods such as trust region [9], augmented Lagrangian [4] and interior point methods. The research into interior point methods

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

culminated with Nesterov and Nemirovskii s [22] ground breaking work proving that minimizing a convex (self-concordant) function could be done in a polynomial number of steps, where in each step a Newton system was solved.

Amongst the properties that make Newton type methods so attractive is that they are invariant to rescaling and coordinate transformations. This property makes them particularly appealing for offthe-shelf solvers since they work well independently of how the user chooses to scale or represent the variables. This in turn means that Newton based methods need little or no tuning of hyperparameters. This is in contrast with ﬁrst-order methods1, where even rescaling the function can result in a signiﬁcantly different sequence of iterates, and their efﬁcient execution relies on parameter tuning (typically the stepsize).

Despite these advantages, Newton based solvers are now facing a challenge that renders most of them inapplicable: large dimensional feature spaces. Indeed, solving a generic Newton system costs O(d3). While inexact Newton methods [11, 5] made signiﬁcant headway to diminishing this high cost by relying on Krylov based solvers whose iterations cost O(d2), this too can be prohibitive, and this is why ﬁrst order methods such as accelerated gradient descent [24] are often used in the large dimensional setting.

In this work we develop a family of randomized Newton methods which work by leveraging randomized sketching and projecting [16]. The resulting randomized Newton method has a global linear convergence for virtually any type and size of sketching matrix. In particular, one can choose a sketch of size one, which yields a low iteration complexity of as little as O(1) if one assumes that scalar derivatives can be computed in constant time. Our main assumptions are the recently introduced [18] relative smoothness and convexity2 of f, which are in a certain sense weaker than the more common strong convexity and smoothness assumptions. Our method is also scale invariant, which facilitates setting the stepsize. We further propose an efﬁcient line search strategy that does not increase the iteration complexity.

There are only a handful of Newton type methods in the literature that use iterative sketching, including the sketched Newton algorithm [28], SDNA (Stochastic Dual Newton Ascent) [29], RBCN (Randomized Block Cubic Newton) [12] and SON [21]. In the unconstrained case the sketched Newton algorithm [28] requires a sketching matrix that is proportional to the global rank of the Hessian, an unknown constant related to high probability statements and ϵ 2, where ϵ > 0 is the desired tolerance. Consequently, the required sketch size could be as large as d, which defeats the purpose.

The SDNA algorithm in [29] relies on the existence of a positive deﬁnite matrix M Rd d that globally upper bounds the Hessian, which is a stronger assumption than our relative smoothness assumption. The method then proceeds by selecting random principal submatrices of M that it then uses to form and solve an approximate Newton system. The theory in [29] allows for any sketch size, including size of one. Our method could be seen as an extension of SDNA to allow for any sketch, one that is directly applied to the Hessian (as opposed to M) and one that relies on a set of more relaxed assumptions. The RBCN method combines the ideas of randomized coordinate descent [23] and cubic regularization [25]. The method requires the optimization problem to be block separable and is hence not applicable to the problem we consider here. Finally, SON [21] uses random and deterministic streaming sketches to scale up a second-order method, akin to a Gauss Newton method, for solving online learning problems.

1.2 Key Assumptions

We assume throughout that f : Rd R is a convex and twice differentiable function. Further, we assume that f is bounded below and the set of minimizers X nonempty. We denote the optimal value of (1) by f R.

Let H(x) := 2f(x) (resp. g(x) = f(x)) be the Hessian (resp. gradient) of f at x. We ﬁx an initial iterate x0 Rd throughout and deﬁne Q to be a level set of function f(x) associated with x0:

Q := x Rd : f(x) f(x0) . (2)

Let x, y H(xk) := H(xk)x, y for all x, y Rd. Our main assumption on f is given next.

1An exception to this is, for instance, the optimal ﬁrst order afﬁne-invariant method in [10]. 2These notions are different from the relative smoothness and convexity concepts considered in [20].

Assumption 1. There exist constants ˆL ˆµ > 0 such that for all x, y Q:

f(x) f(y) + g(y), x y + ˆL 2 x y 2 H(y) | {z } :=T (x,y)

f(x) f(y) + g(y), x y + ˆµ

2 x y 2 H(y). (4)

We refer to ˆL and ˆµ as the relative smoothness and relative convexity constant, respectively.

Relative smoothness and convexity is a direct consequence of smoothness and strong convexity. It is also a consequence of the recently introduced [18] c stability condition, which served to us as an inspiration. Speciﬁcally, as shown in Lemma 2 in [18] and also formally (for convenience) stated in Proposition 2 in the supplementary material, we have that

L smooth + µ strongly convex c stability relative smoothness & relative convexity.

We will also further assume:

Assumption 2. g(x) Range (H(x)) for all x Rd.

Assumption 2 holds if the Hessian is positive deﬁnite for all x, and for generalized linear models.

1.3 The full Newton method

Our baseline method for solving (1), is the following variant of the Newton Method (NM):

xk+1 = xk + γn(xk) := xk γH (xk)g(xk), (5)

where H (xk) is the Moore-Penrose pseudoinverse of H(xk) and n(xk) := H (xk)g(xk) is the Newton direction. A property (which we recall from [18]) that will be important for our analysis is that for a suitable stepsize, Newton s method is a descent method.

Lemma 1. Consider the iterates {xk}k 0 deﬁned recursively by (5). If γ 1/ˆL and (3) holds, then f(xk+1) f(xk) for all k 0, and in particular, xk Q for all k 0.

The proof follows by using (3), twice differentiability and convexity of f. See [18, Lemma 3].

The relative smoothness assumption (3) is particularily important for motivating Newton s method. Indeed, a Newton step is the exact minimizer of the upper bound in (3).

Lemma 2. If Assumption 2 is satisﬁed, then the quadratic x 7 T(x, xk) deﬁned in (3) has a global minimizer xk+1 given by xk+1 = xk 1

ˆLH (xk)g(xk) Q.

Proof. Lemma 1 implies that xk+1 Q, and Lemma 9 in the appendix shows that (5) is a global minimizer for γ = 1/ˆL.

2 Randomized Subspace Newton

Solving a Newton system exactly is costly and may be a waste of resources. Indeed, this is the reason for the existence of inexact variants of Newton methods [11]. For these inexact Newton methods, an accurate solution is only needed when close to the optimal point.

In this work we introduce a different inexactness idea: we propose to solve an exact Newton system, but in an inexact randomly selected subspace. In other words, we propose a randomized subspace Newton method, where the randomness is introduced via sketching matrices, deﬁned next.

Deﬁnition 1. Let D be a (discrete or continuous) distribution over matrices in Rd s. We say that S D is a random sketching matrix and s N is the sketch size.

We will often assume that the random sketching is nullspace preserving.

Assumption 3. We say that S D is nullspace preserving if with probability one we have that

Null S H(x)S = Null(S), x Q. (6)

Algorithm 1 RSN: Randomized Subspace Newton

1: input: x0 Rd

2: parameters: D = distribution over random matrices 3: for k = 0, 1, 2, . . . do 4: sample a fresh sketching matrix: Sk D

5: xk+1 = xk 1

ˆLSk S k H(xk)Sk S k g(xk)

6: output: last iterate xk

By sampling a sketching matrix Sk D in the kth iteration, we can form a sketched Newton direction using only the sketched Hessian S k H(xk)Sk Rs s; see line 5 in Algorithm 1. Note that the sketched Hessian is the result of twice differentiating the function λ 7 f(xk +Skλ), which can be done efﬁciently using a single backpropation pass [14] or s backpropagation passes [7] which costs at most s times the cost of evaluating the function f.

First we show that much like the full Newton method (5), Algorithm 1 is a descent method.

Lemma 3 (Descent). Consider the iterates xk given Algorithm 1. If Assumptions 1, 2 and 3 hold, then f(xk+1) f(xk) and consequently xk Q for all k 0.

While common in the literature of randomized coordinate (subspace) descent method, this is a rare result for randomized stochastic gradient descent methods, which do not enjoy a descent property. Lemma 3 is useful in monitoring the progress of the method in cases when function evaluations are not too prohibitive. However, we use it solely for establishing a tighter convergence theory.

Interestingly, the iterations of Algorithm 1 can be equivalently formulated as a random projection of the full Newton step, as we detail next.

Lemma 4. Let Assumptions 1 and 2 hold. Consider the projection matrix Pk with respect to the seminorm 2 H(xk) := , H(xk) given by

Pk := Sk S k H(xk)Sk S k H(xk) Rd d. (7)

The iterates of Algorithm 1 can be viewed as a projection of the Newton step given by

xk+1 = xk + 1

ˆLPkn(xk) . (8)

Proof. To verify that Pk is an oblique projection matrix, it sufﬁces to check that

Pkx, Pky H(xk) = Pkx, y H(xk) , x, y Rd,

which in turn relies on the identity M MM = M , which holds for all matrices M Rd d. Since g(xk) Range (H(xk)) , we have again by the same identity of the pseudoinverse that

g(xk) = H(xk)H (xk)g(xk) = H(xk)n(xk). (9)

Consequently, Pkn(xk) = Sk S k H(xk)Sk S k g(xk).

We will refer to Pkn(xk) as the sketched Newton direction. If we add one more simple assumption to the selection of the sketching matrices, we have the following equivalent formulations of the sketched Newton direction.

Lemma 5. Let Assumptions 1, 2 and 3 hold. It follows that the xk+1 iterate of Algorithm 1 can be equivalently seen as

1. The minimizer of T(x, xk) over the random subspace x xk + Range (Sk) :

xk+1 = xk + Skλk, where λk arg min λ Rs T (xk + Skλ, xk) . (10)

Furthermore, T(xk+1, xk) = f(xk) 1 2ˆL g(xk) 2 Sk(S k H(xk)Sk) Sk . (11)

2. A projection of the Newton direction onto a random subspace:

xk+1 = arg min x Rd, λ Rs x xk 1

ˆLn(xk) 2 H(xk) subject to x = xk + Skλ. (12)

3. A projection of the previous iterate onto the sketched Newton system given by:

xk+1 arg min x xk 2 H(xk) subject to S k H(xk)(x xk) = 1

ˆLS k g(xk). (13)

Furthermore, if Range (Sk) Range (Hk(xk)), then xk+1 is the unique solution to the above.

3 Convergence Theory

We now present two main convergence theorems.

Theorem 2. Let G(x) := ES D h S S H(x)S S i and deﬁne

ρ(x) := min v Range(H(x)) H1/2(x)G(x)H1/2(x)v,v

v 2 2 and ρ := min x Q ρ(x). (14)

If Assumptions 1 and 2 hold, then

E [f(xk)] f 1 ρ ˆµ

k (f(x0) f ). (15)

Consequently, given ϵ > 0, if ρ > 0 and if

ρ ˆL ˆµ log f(x0) f

ϵ , then E [f(xk) f ] < ϵ. (16)

Theorem 2 includes the convergence of the full Newton method as a special case. Indeed, when we choose3 Sk = I Rd d, it is not hard to show that ρ(xk) 1, and thus (16) recovers the ˆL/ˆµ log (1/ϵ) complexity given in [18]. We provide yet an additional sublinear O(1/k) convergence result that holds even when ˆµ = 0.

Theorem 3. Let Assumption 2 hold and Assumption 1 be satisﬁed with ˆL > ˆµ = 0. If R := inf x X sup x Q x x H(x) < + , (17)

and ρ > 0 then E [f(xk)] f 2ˆLR2

As a new result of Theorem 3, we can also show that the full Newton method has a O(ˆLRϵ 1) iteration complexity.

Both of the above theorems rely on ρ > 0. So in the next Section 3.1 we give sufﬁcient conditions for ρ > 0 that holds for virtually all sketching matrices.

3.1 The sketched condition number ρ(xk)

The parameters ρ(xk) and ρ in Theorem 2 characterize the trade-off between the cost of the iterations and the convergence rate of RSN. Here we show that ρ is always bounded between one and zero, and further, we give conditions under which ρ(xk) is the smallest non-zero eigenvalue of an expected projection matrix, and is thus bounded away from zero. Lemma 6. The parameter ρ(xk) appearing in Theorem 2 satisﬁes 0 ρ(xk) 1. Letting

ˆP(xk) := H1/2(xk)Sk S k H(xk)Sk S k H1/2(xk) , (18)

and if we assume that the exactness4 condition

Range (H(xk)) = Range ES D h ˆP(xk) i (19)

holds then ρ(xk) = λ+ min ES D h ˆP(xk) i > 0.

3Or when Sk is an invertible matrix. 4An exactness condition similar to (19) was introduced in [30] in a program of exactly reformulating a linear system into a stochastic optimization problem. Our condition has a similar meaning, but we do not elaborate on this as this is not central to the developments in this paper.

Since (19) is in general hard to verify, we give simpler sufﬁcient conditions for ρ > 0 in the next lemma. Lemma 7 (Sufﬁcient condition for exactness). If Assumption 3 and

Range (H(xk)) Range E[Sk S k ] , (20) holds then (19) holds and consequently 0 < ρ 1.

Clearly, condition (20) is immediately satisﬁed if E Sk S k is invertible, and this is the case for Gaussian sketches, weighted coordinate sketched, sub-sampled Hadamard or Fourier transforms, and the entire class of randomized orthonormal system sketches [27].

3.2 The relative smoothness and strong convexity constants

In the next lemma we give an insightful formula for calculating the relative smoothness and convexity constants deﬁned in Assumption 1, and in particular, show how ˆL and ˆµ depend on the relative change of the Hessian. Lemma 8. Let f be twice differentiable, satisfying Assumption 1. If moreover H(x) is invertible for every x Rd, then

ˆL = max x, y Q

t=0 2(1 t) zt y 2 H(zt) zt y 2 H(y) dt max x,y Q

x y 2 H(x) x y 2 H(y) := c (21)

ˆµ = min x, y Q

t=0 2(1 t) zt y 2 H(zt) zt y 2 H(y) dt 1 c, (22)

where zt := y + t(x y).

The constant c on the right hand side of (21) is known as the c-stability constant [18]. As a byproduct, the above lemma establishes that the rates for the deterministic Newton method obtained as a special case of our general theorems are at least as good as those obtained in [18] using c-stability.

With the freedom of choosing the sketch size, we can consider the extreme case s = 1, i.e., the case with the sketching matrices having only a single column. Corollary 1 (Single column sketches). Let 0 U Rn n be a symmetric positive deﬁnite matrix such that H(x) U, x Rd. Let D = [d1, . . . , dn] Rn n be a given invertible matrix such that d i H(x)di = 0 for all x Q and i = 1, . . . , n. If we sample according to

P[Sk = di] = pi := d i Udi Trace(D UD),

then the update on line 5 of Algorithm 1 is given by

xk+1 = xk 1

ˆL d i g(xk) d i H(xk)di di, with probability pi, (23)

and under the assumptions of Theorem 2, Algorithm 1 converges according to

E [f(xk)] f 1 min x Q

λ+ min(H1/2(x)DD H1/2(x))

Trace(D UD) ˆµ ˆL

k (f(x0) f ). (24)

Each iteration of single colum sketching Newton method (23) requires only three scalar derivatives of the function t 7 f(xk + tdk) and thus if f(x) can be evaluated in constant time, this amounts to O(1) cost per iteration. Indeed (23) is much like coordinate descent, except we descent along the di directions, and with a stepsize that adapts depending on the curvature information d i H(xk)di. 5

The rate of convergence in (24) suggests that we should choose D U 1/2 so that ρ is large. If there is no efﬁcient way to approximate U 1/2, then the simple choice of D = I gives ρ(xk) = λ+ min(H(xk))/Trace (U) .

An expressive family of functions that satisfy Assumption 1 are generalized linear models.

5There in fact exists a block coordinate method that also incorporates second order information [13].

Deﬁnition 4. Let 0 u ℓ. Let φi : R 7 R+ be a twice differentiable function such that

u φ i (t) ℓ, for i = 1, . . . , n. (25)

Let ai Rd for i = 1, . . . , n and A = [a1, . . . , an] Rd n. We say that f : Rd R is a generalized linear model when

i=1 φ(a i x) + λ

2 x 2 2 . (26)

The structure of the Hessian of a generalized linear model is such that highly efﬁcient fast Johnson Lindenstrauss sketches [3] can be used. Indeed, the Hessian is given by

i=1 aia i φ i (a i x) + λI = 1

n AΦ (A x)A + λI ,

and consequently, for computing the sketch Hessian S k H(xk)Sk we only need to sketch the ﬁxed matrix S k A and compute S k Sk efﬁciently, and thus no backpropgation is required. This is exactly the setting where fast Johnson Lindenstrauss transforms can be effective [17, 3].

We now give a simple expression for computing the relative smoothness and convexity constant for generalized linear models. Proposition 1. Let f : Rd R be a generalized linear model with 0 u ℓ. Then Assumption 1 is satisﬁed with ˆL = ℓσ2 max(A)+nλ uσ2max(A)+nλ and ˆµ = uσ2 max(A)+nλ ℓσ2max(A)+nλ . (27)

Furthermore, if we apply Algorithm 1 with a sketch such that E SS is invertible, then the iteration complexity (16) of applying Algorithm 1 is given by

ρ ℓσ2 max(A)+nλ uσ2max(A)+nλ 2 log 1

This complexity estimate (28) should be contrasted with that of gradient descent. When x0 Range (A) , the iteration complexity of GD (gradient descent) applied to a smooth generalized linear model is given by ℓσ2 max(A)+nλ uσ2 min+(A)+nλ log 1

ϵ , where σmin+(A) is the smallest non-zero singular

value of A. To simplify the discussion, and as a santiy check, consider the full Newton method with Sk = I for all k, and consequently ρ = 1. In view of (28) Newton method does not depend on the smallest singular values nor the condition number of the data matrix. This suggests that for ill-conditioned problems Newton method can be superior to gradient descent, as is well known.

5 Experiments and Heuristics

Table 1: Details of the data sets taken from LIBSM [6] and Open ML [31]. dataset non-zero features (d) samples (n) density chemotherapy 61,359 + 1 158 +1 1 gisette 5,000 + 1 6000 0.9910 news20 1,355,191 + 1 19996 0.0003 rcv1 47,237 + 1 20,241 0.0016 real-sim 20,958 + 1 72,309 0.0025 webspam 680,715 + 1 350,000 0.0055

In this section we evaluate and compare the computational performance of RSN (Algorithm 1) on generalized linear models (26). Speciﬁcally, we focus on logistic regression, i.e., φi(t) = ln (1 + e yit) , where yi { 1, 1} are the target values for i = 1, . . . , n. Gradient descent (GD), accelerated gradient descent (AGD) [24] and full Newton methods6 are compared with RSN. For simplicity, block coordinate sketches are used; these are random sketch matrices of the form Sk {0, 1}d s with exactly one non-zero entry per row and per column. We will refer to s N as the sketch size. To ensure fairness and for comparability purposes, all methods were supplied with

6To implement the Newton s method efﬁciently, of course we exploit the Sherman Morrison Woodbury matrix identity [32] when appropriate

the exact Lipschitz constants and equipped with the same line-search strategy (see Algorithm 3 in the supplementary material). We consider 6 datasets with a diverse number of features and samples (see Table 1 for details) which were modiﬁed by removing all zero features and adding an intercept, i.e., a constant feature.

For regularization we used λ = 10 10 and stopped methods once the gradients norm was below tol = 10 6 or some maximal number of iterations had been exhausted. In Figures 1 to 3 we plotted iterations and wall-clock time vs gradient norm, respectively.

0 500 1000 1500 2000 2500 3000 3500 4000 10 -6

0 0.05 0.1 0.15 0.2 0.25 10 -6

0 500 1000 1500 10 -6

0 5 10 15 20 25 30 35 40

RSN, s=250 RSN, s=500 RSN, s=750 RSN, s=1000 GD AGD Newton

Figure 1: Highly dense problems, favoring RSN methods.

0 1 2 3 4 5

0 50 100 150 200 10 -6

0 2000 4000 6000 8000 10000 12000 14000 16000 10 -6

0 50 100 150 200

RSN, s=250 RSN, s=500 RSN, s=750 RSN, s=1000 GD AGD Newton

Figure 2: Due to extreme sparsity, accelerated gradient is competitive with the Newton type methods.

0 0.5 1 1.5 2 2.5 3

0 0.5 1 1.5 2

0 1000 2000 3000 4000 5000 6000 10 -6

0 20 40 60 80 100

RSN, s=250 RSN, s=500 RSN, s=750 RSN, s=1000 GD AGD

Figure 3: Moderately sparse problems favor the RSN method. The full Newton method is infeasible due to high dimensionality.

Newton s method, when not limited by the immense costs of forming and solving linear systems, is competitive as we can see in the gisette problem in Figure 1. In most real-world applications however, the bottleneck is exactly within the linear systems which may, even if they can be formed at all, require signiﬁcant solving time. On the other end of the spectrum, GD and AGD need usually more iterations and therefore may suffer from expensive full gradient evaluations, for example due to higher density of the data matrix, see Figure 3. RSN seems like a good compromise here: As the sketch size and type can be controlled by the user, the involved linear systems can be kept reasonably sized. As a result, the RSN is the fastest method in all the above experiments, with the exception of the extremely sparse problem news20 in Figure 2, where AGD outruns RSN with s = 750 by approximately 20 seconds.

6 Conclusions and Future Work

We have laid out the foundational theory of a class of randomized Newton methods, and also performed numerical experiments validating the methods. There are now several venues of work to explore including 1) combining the randomized Newton method with subsampling so that it can be applied to data that is both high dimensional and abundant 2) leveraging the potential fast Johnson Lindenstrauss sketches to design even faster variants of RSN 3) develop heuristic sketches based on past descent directions inspired on the quasi-Newton methods [15].

[1] John T. Abatzoglou, Solomon Z. Dobrowski, Sean A. Parks, and Katherine C. Hegewisch. Data descriptor: Terraclimate, a high-resolution global dataset of monthly climate and climatic water balance from 1958-2015. Scientiﬁc Data, 5, 2018.

[2] T. G. Addair, D. A. Dodge, W. R. Walter, and S. D. Ruppert. Large-scale seismic signal analysis with hadoop. Computers and Geosciences, 66(C), 2014.

[3] Nir Ailon and Bernard Chazelle. The fast johnson-lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput., 39(1):302 322, May 2009.

[4] Dimitri P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods (Optimization and Neural Computation Series). Athena Scientiﬁc, 1996.

[5] Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977 995, 2011.

[6] Chih Chung Chang and Chih Jen Lin. LIBSVM : A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):1 27, April 2011.

[7] Bruce Christianson. Automatic Hessians by reverse accumulation. IMA Journal of Numerical Analysis, 12(2):135 150, 1992.

[8] James R. Cole, Qiong Wang, Jordan A. Fish, Benli Chai, Donna M. Mc Garrell, Yanni Sun, C. Titus Brown, Andrea Porras-Alfaro, Cheryl R. Kuske, and James M. Tiedje. Ribosomal Database Project: data and tools for high throughput r RNA analysis. Nucleic Acids Research, 42(D1):D633 D642, 11 2013.

[9] Andrew R. Conn, Nicholas I. M. Gould, and Philippe L. Toint. Trust-region Methods. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2000.

[10] Alexandre d Aspremont, Guzm an Crist obal, and Martin Jaggi. Optimal afﬁne-invariant smooth minimization algorithms. SIAM Journal on Optimization, 28(3):2384 2405, 2018.

[11] Ron S. Dembo, Stanley C. Eisenstat, and Trond Steihaug. Inexact Newton methods. SIAM Journal on Numerical Analysis, 19(2):400 408, 1982.

[12] Nikita Doikov and Peter Richt arik. Randomized block cubic Newton method. In Proceedings of the 35th International Conference on Machine Learning, 2018.

[13] Kimon Fountoulakis and Rachael Tappenden. A ﬂexible coordinate descent method. Computational Optimization and Applications, 70(2):351 394, Jun 2018.

[14] R M Gower and M P Mello. A new framework for the computation of hessians. Optimization Methods and Software, 27(2):251 273, 2012.

[15] Robert M. Gower, Donald Goldfarb, and Peter Richt arik. Stochastic block BFGS: Squeezing more curvature out of data. Proceedings of the 33rd International Conference on Machine Learning, 2016.

[16] Robert Mansel Gower and Peter Richt arik. Randomized iterative methods for linear systems. SIAM Journal on Matrix Analysis and Applications, 36(4):1660 1690, 2015.

[17] William Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982), volume 26 of Contemporary Mathematics, pages 189 206. American Mathematical Society, 1984.

[18] Sai Praneeth Karimireddy, Sebastian U. Stich, and Martin Jaggi. Global linear convergence of Newton s method without strong-convexity or Lipschitz gradients. ar Xiv:1806:0041, 2018.

[19] C. H. Lee and H. J. Yoon. Medical big data: promise and challenges. kidney research and clinical practice. Kidney Res Clin Pract, 36(4):3 1, 2017.

[20] Haihao Lu, Robert M. Freund, and Yurii Nesterov. Relatively smooth convex optimization by ﬁrst-order methods, and applications. SIAM Journal on Optimization, 28(1):333 354, 2018.

[21] Haipeng Luo, Alekh Agarwal, Nicol o Cesa-Bianchi, and John Langford. Efﬁcient second order online learning by sketching. In Advances in Neural Information Processing Systems 29, pages 902 910. 2016.

[22] Y. Nesterov and A. Nemirovskii. Interior Point Polynomial Algorithms in Convex Programming. Studies in Applied Mathematics. Society for Industrial and Applied Mathematics, 1987. [23] Yurii Nesterov. Efﬁciency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341 362, 2012. [24] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing Company, Incorporated, 1 edition, 2014. [25] Yurii Nesterov and Boris T. Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006. [26] Ross A. Overbeek, Niels Larsen, Gordon D. Pusch, Mark D Souza, Evgeni Selkov Jr., Nikos Kyrpides, Michael Fonstein, Natalia Maltsev, and Evgeni Selkov. WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Research, 28(1):123 125, 2000. [27] Mert Pilanci and Martin J. Wainwright. Iterative Hessian sketch : Fast and accurate solution approximation for constrained least-squares. Journal of Machine Learning Research, 17:1 33, 2016. [28] Mert Pilanci and Martin J. Wainwright. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205 245, 2017. [29] Zheng Qu, Peter Richt arik, Martin Tak aˇc, and Olivier Fercoq. SDNA: Stochastic dual Newton ascent for empirical risk minimization. In Proceedings of the 33rd International Conference on Machine Learning, 2016. [30] Peter Richt arik and Martin Tak aˇc. Stochastic reformulations of linear systems: algorithms and convergence theory. ar Xiv:1706.01108, 2017. [31] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. Openml: Networked science in machine learning. SIGKDD Explorations, 15(2):49 60, 2013. [32] Max A Woodbury. Inverting modiﬁed matrices. Technical report, Rep. no. 42, Statistical Research Group, Princeton University, 1950. [33] Tjalling J. Ypma. Historical development of the newton-raphson method. SIAM Rev., 37(4):531 551, December 1995.