# stochastic_subspace_cubic_newton_method__34e42724.pdf

Stochastic Subspace Cubic Newton Method

Filip Hanzely 1 Nikita Doikov 2 Peter Richt arik 1 Yurii Nesterov 2

In this paper, we propose a new randomized second-order optimization algorithm Stochastic Subspace Cubic Newton (SSCN) for minimizing a high dimensional convex function f. Our method can be seen both as a stochastic extension of the cubically-regularized Newton method of Nesterov and Polyak (2006), and a second-order enhancement of stochastic subspace descent of Kozak et al. (2019). We prove that as we vary the minibatch size, the global convergence rate of SSCN interpolates between the rate of stochastic coordinate descent (CD) and the rate of cubic regularized Newton, thus giving new insights into the connection between ﬁrst and second-order methods. Remarkably, the local convergence rate of SSCN matches the rate of stochastic subspace descent applied to the problem of minimizing the quadratic function 1

2(x x ) 2f(x )(x x ), where x is the minimizer of f, and hence depends on the properties of f at the optimum only. Our numerical experiments show that SSCN outperforms non-accelerated ﬁrst-order CD algorithms while being competitive to their accelerated variants.

1. Introduction

In this work we consider the optimization problem

min x Rd {F(x) := f(x) + ψ(x)} , (1)

where f : Rd R is convex and twice differentiable and ψ : Rd R {+ } is a simple convex function. We

1King Abdullah University of Science and Technology, Thuwal, Saudi Arabia 2Catholic University of Louvain, Louvain-la-Neuve, Belgium. Correspondence to: Filip Hanzely <fhanzely@gmail.com>, Nikita Doikov <nikita.doikov@uclouvain.be>, Peter Richt arik <peter.richtarik@kaust.edu.sa>, Yurii Nesterov <yurii.nesterov@uclouvain.be>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

are interested in the regime where the dimension d is very large, which arises in many contexts, such as the training of modern over-parameterized machine learning models. In this regime, coordinate descent (CD) methods, or more generally subspace descent methods, are the methods of choice.

1.1. Subspace descent methods

Subspace descent methods rely on update rules of the form

x+ = x + Sh, S Rd τ(S), h Rτ(S), (2)

where S is a thin matrix, typically with a negligible number of columns compared to the dimension (i.e., τ(S) d). That is, they move from x to x+ along the subspace spanned by the columns of S.

In these methods, the subspace matrix S is typically chosen ﬁrst, followed by the determination of the parameters h which deﬁne the linear combination of the columns determining the update direction. Several different rules have been proposed in the literature for choosing the matrix S, including greedy, cyclic and randomized rules. In this work we consider a randomized rule. In particular, we assume that S is sampled from an arbitrary but ﬁxed distribution D restricted to requiring that S be of full column rank1 with probability one.

Once S D is sampled, a rule for deciding the stepsize h varies from algorithm to algorithm, but is mostly determined by the underlying oracle model for information access to function f. For instance, ﬁrst-order methods require access to the subspace gradient Sf(x) := S f(x), and are relatively well studied (Nesterov, 2012; Stich et al., 2013; Richt arik & Tak aˇc, 2014; Wright, 2015; Kozak et al., 2019). At the other extreme are variants performing a full subspace minimization, i.e., f is minimized over the afﬁne subspace given by {x + Sh | h Rτ(S)};

see (Chang et al., 2008). In particular, in this paper we are interested in the second-order oracle model; i.e., we

1It is rather simple to extend our results to matrices S which are column-rank deﬁcient. However, this would introduce a rather heavy notation burden which we decided to avoid for the sake of clarity and readability.

Stochastic Subspace Cubic Newton Method

claim access both to the subspace gradient Sf(x) and the subspace Hessian 2 Sf(x) := S 2f(x)S.

1.2. Contributions

We now summarize our contributions:

(a) New 2nd order subspace method. We propose a new stochastic subspace method Stochastic Subspace Cubic Newton (SSCN) constructed by minimizing an oracle-consistent global upper bound on the objective f in each iteration (Section 3). This bound is formed using both the subspace gradient and the subspace Hessian at the current iterate and relies on Lipschitzness of the subspace Hessian.

(b) Interpolating global rate. We prove (Section 5) that SSCN enjoys a global convergence rate that interpolates between the rate of stochastic CD and the rate of cubic regularized Newton as one varies the expected dimension of the subspace, E [τ(S)].

(c) Fast local rate. Remarkably, we establish a local convergence bound for SSCN (Section 6) that matches the rate of stochastic subspace descent (SSD) (Gower & Richt arik, 2015) applied to solving the problem

min x Rd 1 2(x x ) 2f(x )(x x ), (3)

where x is the solution of (1). Thus, SSCN behaves as if it had access to a perfect second order model of f at the optimum, and was given the (intuitively much simpler) task of minimizing this model instead. Furthermore, note that SSD (Gower & Richt arik, 2015) applied to minimize a convex quadratic can be interpreted as doing an exact subspace search in each iteration, i.e., it minimizes the objective exactly along the active subspace (Richt arik & Tak aˇc, 2017). Therefore, the local rate of SSCN matches the rate of the greediest strategy for choosing h in the active subspace, and as such, this rate is the best one can hope for a method that does not incorporate some form of acceleration.

(d) Special cases. We discuss in Section 3.2 how SSCN reduces to several existing stochastic second order methods in special cases, either recovering the best known rates, or improving upon them. This includes SDSA (Gower & Richt arik, 2015), CN (Griewank, 1981; Nesterov & Polyak, 2006) and RBCN Doikov & Richt arik (2018). However, our method is more general and hence allows for more applications.

We discuss more remotely related literature in Section 4. We now give a simple example of our setting.

Example 1 (Coordinate subspace setup). Let Id Rd d be the identity and let S be a random subset of {1, 2, . . . , d}. Given that S = Id (:,S) with probability 1, the oracle model reveals ( f(x))S and ( 2f(x))(S,S). Therefore, we have access to a random block of partial derivatives of f and a block submatrix of its Hessian, both corresponding to the subset of indices S. Furthermore, the rule (2) updates a subset S of coordinates only. In this setting, our method is a new second-order coordinate subspace descent method.

2. Preliminaries

Throughout the paper, we assume that f is convex, twice differentiable, and sufﬁciently smooth and that ψ is convex, albeit possibly non-differentiable.2

Assumption 2.1. Function f : Rd R is convex and twice differentiable with M-Lipschitz continuous Hessian. Function ψ : Rd R {+ } is proper closed and convex.

We always assume that a minimum of F exists and by x

denote any of its minimizers. We let F := F(x ).

Since our method always takes steps along random subspaces spanned by the columns of S Rd τ(S), it is reasonable to deﬁne the Lipschitzness of the Hessian over the range of S:3

MS := max x Rd max h Rτ(S), h =0

| 3f(x)[Sh]3|

As the next lemma shows, the maximal value of MS for any S of width τ can be up to ( d

τ ) 3 2 times smaller than M and this will lead to a tighter approximation of the objective. Lemma 2.2. We have

M max τ(S)=τ MS.

Moreover, there is a problem where

max τ(S)=τ MS = τ

Lastly, if Range (S) = Range (S ), then MS = MS .

The next lemma provides a direct motivation for our algorithm. It gives a global upper bound on the objective over a random subspace, given the ﬁrst and second-order information at the current point. Lemma 2.3. Let x Rd, S Rd τ(S), h Rτ(S) and x+

be as in (2). Then

|f(x+) f(x) Sf(x), h 1

2 2 Sf(x)h, h |

6 Sh 3. (5)

2We will also require separability of ψ; see Section 5.1. 3By x := x, x 1/2we denote the standard Euclidean norm.

Stochastic Subspace Cubic Newton Method

As a consequence, we have

F(x+) f(x) + TS(x, h), (6)

where TS(x, h) := Sf(x), h + 1 2 2 Sf(x)h, h + MS

6 Sh 3 + ψ(x + Sh).

We shall also note that for function ψ we require separability with respect to the sampling distribution (see Deﬁnition 5.5 and the corresponding Assumption 5.6 in Section 5.1).

For better orientation throughout the paper, we provide a table of frequently used notation in the Appendix.

3. Algorithm

For a given S and current iterate xk, it is a natural idea to choose h as a minimizer of the upper bound (6) in h for x = xk, and subsequently set xk+1 = x+ via (2). Note that we are choosing S randomly according to a ﬁxed distribution D (with a possibly random number of columns). We have just described SSCN Stochastic Subspace Cubic Newton formally stated as Algorithm 1.

Algorithm 1 SSCN: Stochastic Subspace Cubic Newton

1: Initialization: x0, distribution D of random matrices with d rows and full column rank 2: for k = 0, 1, . . . do 3: Sample S from distribution D 4: hk = argminh Rτ(S) TS(xk, h) 5: Set xk+1 = xk + Shk

Remark 1. Inequality (6) becomes an equality with h = 0. As a consequence, we must have F(xk+1) F(xk), and thus the sequence {F(xk)}k 0 is non-increasing.

3.1. Solving the subproblem

Algorithm 1 requires TS to be minimized in h each iteration. As this operation does not have a closed-form solution in general, it requires an optimization subroutine itself of a possibly non-trivial complexity, which we discuss here.

The subproblem without ψ. Let us now consider the case when ψ(x) 0 in which our problem (1) does not contain any nondifferentiable components. Various techniques for minimizing regularized quadratic functions were developed during the development of Trust-region methods (see (Conn et al., 2000)), and applied to Cubic regularization in (Nesterov & Polyak, 2006). The classical approach consists in performing some diagonalization of the matrix 2 Sf(x) ﬁrst, by computing the eigenvalue or tridiagonal decomposition, which costs O(τ(S)3) arithmetical operations. Then, to ﬁnd the minimizer, it merely remains to solve a one-dimensional nonlinear equation (this part can be done

by O(1) iterations of the one-dimensional Newton method, with a linear cost per step). More details and analysis of this procedure can be found in (Gould et al., 2010).

The next example gives a setting in which an explicit formula for the minimizer of TS can be deduced. Example 2. Let ei be the ith unit basis vector in Rd. If S {e1, . . . , ed} with probability 1 and ψ(x) = 0, the update rule can be written as xk+1 = xk αk i ei, with

αk i = 2 if(xk)

2 i f(xk) + q

( 2 iif(xk))2 + 2Mei| if(xk)| ,

thus the cost of solving the subproblem is O(1).

Subproblem with simple ψ. In some scenarios, minimization of TS can be done using a simple algorithm if ψ is simple enough. We now give an example of this. Example 3. If S {e1, . . . , ed} with probability 1, the subproblem can be solved using a binary search given that the evaluation of ψ is cheap. In particular, if we can evaluate ψ(xk + Sh) ψ(xk) in O(1), the cost of solving the subproblem will be O(1).

The subproblem with general ψ. In the case of general regularizers, recent line of work by Carmon & Duchi (2019) explores to the use of ﬁrst-order optimization methods (Gradient Methods) for computing an approximate minimizer of TS. We note that the backbone of such Gradient Methods is an implementation of the following operation (for a given vector b Rτ(S), and positive scalars α, β):

arg min h Rτ(S) b, h + α

3 Sh 3 + ψ(xk + Sh).

To the best of our knowledge, the most efﬁcient gradient method is the Fast Gradient Method (FGM) of Nesterov (2019), achieving an O(1/k6) convergence rate. However, FGM can deal with any ψ as long as the above subproblem is cheap to solve. We shall also note that gradient methods do not require a storage of 2 Sf(x); but rather iteratively access partial Hessian-vector products 2 Sf(x)h.

Line search. Note that in Algorithm 1 we use the Lipschitz constants MS of the subspace Hessian (see Deﬁnition (4)) as the regularization parameters. In many application, MS can be estimated cheaply (see Section 7). In general, however, MS might be unknown or hard to estimate. In such a case, one might use a simple one-dimensional search on each iteration: multiply the estimate of MS by the factor of two until the bound (6) is satisﬁed, and divide it by two at the start of each iteration. Note that the average number of such line search steps per iteration can be bounded by two (see (Grapiglia & Nesterov, 2017) for the details).

Stochastic Subspace Cubic Newton Method

3.2. Special cases

There are several scenarios where SSCN becomes an already known algorithm. We list them below:

Quadratic minimization. If M = 0 and ψ = 0, SSCN reduces to the stochastic dual subspace ascent (SDSA) method (Gower & Richt arik, 2015), ﬁrst analyzed in an equivalent primal form as a sketch-andproject method in (Gower & Richt arik, 2015). In such a case, SSCN performs both ﬁrst-order, second-order updates, and exact minimization over a subspace at the same time due to the quadratic structure of the objective (Richt arik & Tak aˇc, 2017). The convergence rate we provide in Section 6 exactly matches the rate of sketch-and-project as well. As a consequence, we recover a subclass of matrix inversion algorithms (Gower & Richt arik, 2017) together with stochastic spectral (coordinate) descent (Kovalev et al., 2018) along with their convergence theory.

Full-space method. If S = Id with probability 1, SSCN reduces to cubically regularized Newton (CN) (Griewank, 1981; Nesterov & Polyak, 2006). In this case, we recover both existing global convergence rates and superlinear local convergence rates.

Separable non-quadratic part of f. The RBCN method of Doikov & Richt arik (2018) aims to minimize (1) with

f(x) = g(x) + φ(x),

where g, φ are both convex, and φ is separable.4 They assume that

2g(x) A Rd d, x Rd,

while φ has Lipschitz continuous Hessian. In each iteration, RBCN constructs an upper bound on the objective using ﬁrst order information from g only. This is unlike SSCN, which uses second order information from g. In a special case when 2g(x) = A for all x, SSCN and RBCN are identical algorithms. However, RBCN is less general: it requires separable φ, and thus does not cover some of our applications, and takes directions along coordinates only. Further, the rates we provide are better even in the setting where the two methods coincide ( 2g(x) = A). The simplest way to see that is by looking at local convergence RBCN does not achieve the local convergence rate of block CD to minimize (3), which is the best one might hope for.

Besides these particular cases, for a general twicedifferentiable f, SSCN is a new second-order method.

4Separability is deﬁned in Section 5.1.

4. Related Literature

Several methods in the literature are related to SSCN. We brieﬂy review them below.

Cubic regularization of Newton method was proposed ﬁrst by Griewank (1981), and received substantial attention after the work of Nesterov & Polyak (2006), where its global complexity guarantees were established. During the last decade, there was a steady increase of research in second-order methods, discovering Accelerated (Nesterov, 2008; Monteiro & Svaiter, 2013), Adaptive (Cartis et al., 2011a;b), and Universal (Grapiglia & Nesterov, 2017; 2019; Doikov & Nesterov, 2019) schemes (the latter ones are adjusting automatically to the smoothness properties of the objective).

There is a vast literature on ﬁrst-order coordinate descent (CD) methods. While CD with τ = 1 is consistently the same method within the literature (Nesterov, 2012; Richt arik & Tak aˇc, 2014; Wright, 2015), there are several ways to deal with τ > 1. The ﬁrst approach constructs a separable upper bound on the objective (in expectation) in the direction of a random subset of coordinates (Qu & Richt arik, 2016a;b), which is minimized to obtain the next iterate. The second approach SDNA (Qu et al., 2016) works with a tighter non-separable upper bound. SDNA is, therefore, more costly to implement but requires a smaller number of iterations to converge. The literature on ﬁrst-order subspace descent algorithms is slightly less rich, the notable examples are random pursuit (Stich et al., 2013) or stochastic subspace descent (Kozak et al., 2019).

Randomized subspace Newton (RSN) (Gower et al.,

2019) is a method of the form

xk+1 = xk 1

ˆL S 2 Sf(xk) 1 Sf(xk)

for some speciﬁc ﬁxed ˆL. In particular, it can be seen as a method minimizing the following upper bound on the function, which follows from their assumption:

hk = arg min h Sf(xk), h + ˆL 2 2 Sf(xk)h, h .

This is followed by an update over the subspace: xk+1 = xk + Shk. Since both RSN and SSCN are analyzed under different assumptions, the global linear rates are not directly comparable. However, the local rate of SSCN is superior to RSN. We shall also note that RSN is a stochastic subspace version of a method from (Karimireddy et al., 2018).

Stochastic Subspace Cubic Newton Method

Subsampled Newton (SN) methods (Byrd et al., 2011;

Erdogdu & Montanari, 2015; Xu et al., 2016; Roosta Khorasani & Mahoney, 2019) and subsampled cubic regularized Newton methods (Kohler & Lucchi, 2017; Xu et al., 2017; Wang et al., 2018) and stochastic (cubic regularized) Newton methods (Tripuraneni et al., 2018; Cartis & Scheinberg, 2018; Kovalev et al., 2019) are stochastic second-order algorithms to tackle ﬁnite sum minimization. Their major disadvantage is a requirement of an immense sample size, which makes them often impractical if used as theory prescribes. A notable exception that does not require a large sample size was recently proposed by Kovalev et al. (2019). However, none of these methods are directly comparable to SSCN as they are not subspace descent methods, but rather randomize over data points (or sketch the Hessian from inside (Pilanci & Wainwright, 2017)).

5. Global Complexity Bounds

We ﬁrst start presenting the global complexity results of SSCN.

Throughout this section, we require some kind of uniformity of the distribution D over subspaces given by S. In particular, we require

PS := S S S 1 S ,

the projection matrix onto the range of S, to be a scalar multiple of identity matrix in expectation.

Assumption 5.1. τ > 0 such that distribution D satisﬁes

A direct consequence of Assumption 5.1 is that τ is an expected width of S, as the next lemma states.

Lemma 5.2. If Assumption 5.1 holds, then E [τ(S)] = τ.

As mentioned before, the global complexity results are interpolating between convergence rate of (ﬁrst-order) CD and (global) convergence rate of Cubic Newton. However, ﬁrst-order CD requires Lipschitzness of gradients, and thus we will require it as well.

Assumption 5.3. Function f has L-Lipschitz continuous gradients, i.e., 2f(x) LId for all x Rd.

We will also need an extra assumption on ψ. It is well known that proximal (ﬁrst-order) CD with ﬁxed step size does not converge if ψ is not separable in such case, even if f(xk) = f(x ) we might have f(xk+1) > f(x ). Therefore, we might not hope that SSCN will converge without

additional assumptions on ψ. Informally speaking, separability of ψ with respect to directions given by columns of S is required. To deﬁne it formally, let us introduce ﬁrst the notion of a separable set.

Deﬁnition 5.4. Set Q Rd is called D-separable, if x, y Q, S D: PSx + (Id PS)y Q.

Using the set separability, we next deﬁne a separability of a function.

Deﬁnition 5.5. Function φ : Rd R {+ } is Dseparable if dom φ is D-separable, and there is map φ : dom φ Rd such that

1. x dom φ : φ(x) = φ (x), e ,5

2. x, y dom φ, S D : φ (PSx + (Id PS)y) = PSφ (x) + (Id PS)φ (y).

Example 4. If D is a set of matrices whose columns are standard basis vectors, D-separability reduces to classical (coordinate-wise) separability. Example 5. If D is set of matrices which are column-wise submatrices of orthogonal U, D-separability of φ reduces to classical coordinate-wise separability of φ(U x). Example 6. φ(x) = 1

2 x 2 is D-separable for any D.

Assumption 5.6. Function ψ is Range (D)-separable.

We are now ready to present the convergence rate of SSCN.

5.2. Theory

First, let us introduce the critical lemma from which the main global complexity results are derived. The next lemma states, what is the expected progress we have for one step of SSCN.

Lemma 5.7. Let Assumptions 2.1, 5.1, 5.3 and 5.6 hold. Then, for every k 0 and y Rd we have

E F(xk+1) | xk 1 τ

d F(xk) + τ

2 y xk 2 + M

3 y xk 3 . (8)

Now we are ready to present global complexity results for the general class of convex functions. The convergence rate is obtained by summing (8) over the different iterations k, and with a speciﬁc choice of y.

Theorem 5.8. Let Assumptions 2.1, 5.1, 5.3 and 5.6 hold. Denote

R def = sup x Rd

x x : F(x) F(x0) , (9)

5By e Rd we mean the vector of all ones.

Stochastic Subspace Cubic Newton Method

and suppose that R < + . Then, for every k 1 we have

k2 + F (x0) F

d k) 3 . (10)

Note that convergence rate of the minibatch version6 of ﬁrst-order CD is O d

k . At the same time, (global) convergence rate of cubically regularized Newton method is O MR3

k2 . Therefore, Theorem 5.8 shows that the global rate of SSCN well interpolates between the two extremes, depending on the sample size τ we choose. Remark 2. According to estimate (10), in order to have E F(xk) F ε, it is enough to perform

iterations of SSCN.

Next, we move to the strongly convex case.

Assumption 5.9. Function f is µ-strongly convex, i.e., 2f(x) µId for all x Rd.

Remark 3. Strong convexity of the objective (assumed for Theorem 5.10 later) implies: R < + . Furthermore, due to monotonicity of the sequence {F(xk)}k 0 (see Remark 1), we have xk x R for all k. Therefore, it is sufﬁcient to require Lipschitzness of gradients over the sublevel set, which holds with L = λmax( 2f(x )) + MR.

As both extremes cubic regularized Newton (where S = Id

always) and (ﬁrst-order) CD (S = ei for randomly chosen i) enjoy (global) linear rate under strong convexity, linear convergence of SSCN is expected as well. At the same time, the leading complexity term should be in between the two extremes. Such a result is established as Theorem 5.10.

Theorem 5.10. Let Assumptions 2.1, 5.1, 5.6 and 5.9 hold. Then, E F(xk) F ε, as long as the number of iterations of SSCN is

τ log F (x0) F

Indeed, if S = Id with probability 1 and MR µ, the leading complexity term becomes q

ε which corresponds to the global complexity of cubically regularized Newton for minimizing strongly convex functions (Nesterov & Polyak, 2006). On the other side of the spectrum if S = ei with probability 1

d, the leading complexity term becomes d L

ε, which again corresponds to convergence rate of CD (Nesterov, 2012). Lastly, if 1 < τ < d, the global linear rate interpolates the rates mentioned above.

6Sampling τ coordinates at a time for objectives with LLipschitz gradients.

Remark 4. Proof of Theorem 5.10 only uses the following consequence of strong convexity:

2 x x 2 F(x) F , x Rd (11)

and thus the conditions of Theorem 5.10 might be slightly relaxed.7 For detailed comparison of various relaxations of strong convexity, see (Karimi et al., 2016).

6. Local Convergence

Throughout this section, assume that ψ = 0. We ﬁrst present the key descent lemma, which will be used to obtain local rates. Let

HS(x) := 2 Sf(x) +

2 Sf(x) 1 2 Iτ(S).

Lemma 6.1. We have

f(xk) f(xk+1) 1

2 Sf(xk) 2 H 1 S (xk). (12)

Before stating the convergence theorem, it will be suitable to deﬁne the stochastic condition number of H := 2f(x ):

ζ := λmin H

1 2 E h S S H S 1 S i H

as it will drive the local convergence rate of SSCN.

Theorem 6.2 (Local Convergence). Let Assumptions 2.1, 5.9 hold, and suppose that ψ = 0. For any ϵ > 0 there exists δ > 0 such that if F(x0) F δ, we have

E F(xk) F (1 (1 ϵ) ζ)k F(x0) F (14)

and therefore the local complexity of SSCN is

O ζ 1 log 1

If further M = 0 (i.e., f is quadratic), then ϵ = 0 and δ = , and thus the rate is global.

The proof of Theorem 6.2 along with the exact formulas for ε, δ can be found in Section E of the Appendix. Theorem 6.2 provides a local linear convergence rate of SSCN. While one might expect a superlinear rate to be achievable, this is not the case, and we argue that the rate from Theorem 6.2 is the best one can hope for.

In particular, if M = 0, Algorithm 1 becomes subspace descent for minimizing positive deﬁnite quadratic which is a speciﬁc instance of sketch-and-project (Gower & Richt arik,

7However, this relaxation is not sufﬁcient to obtain the local convergence results.

Stochastic Subspace Cubic Newton Method

2015). However, sketch-and-project only converges linearly the iteration complexity of sketch-and-project to minimize (x x ) A(x x ) with A 0 is

O h λmin A 1 2 E h S S AS 1 S i A 1 2 i 1 log 1

Notice that this rate is matched by Theorem 6.2 in this case.

Next, we compare the local rate of SSCN to the rate of SDNA (Qu et al., 2016). To best of our knowledge, SDNA requires the least oracle calls to minimize f among all ﬁrstorder non-accelerated methods. Remark 5. SDNA is a ﬁrst-order analogue to Algorithm 1 with S = Id (:,S). In particular, given matrix L such that L 2f(x) 0 for all x, the update rule of SDNA is

x+ = x S S LS 1 Sf(x),

where S = Id (:,S) for a random subset of columns S. SDNA enjoys linear convergence rate with leading complexity term µλmin E S(S LS) 1S 1. The leading complexity term of SSCN is ζ 1, and we can bound

ζ λmin (H ) λmin E h S S H S 1 S i

µλmin E h S S LS 1 S i .

Hence, the local rate of SSCN is no worse than the rate of SDNA. Furthermore, both of the above inequalities might be very loose in some cases (i.e., there are examples where ζ µλmin E[S(LS) 1S ] can be arbitrarily high). Therefore, local convergence rate of SSCN might be arbitrarily better than the convergence rate of SDNA. As a consequence, the local convergence of SSCN is better than convergence rate of any non-accelerated ﬁrst order method.8.

Lastly, the local convergence rate provided by Theorem 6.2 recovers the superlinear rate of cubic regularized Newton s method, as the next remark states. Remark 6. If S = Id with probability 1, Algorithm 1 becomes cubic regularized Newton method (Griewank, 1981; Nesterov & Polyak, 2006). For H := 2f(x ) we have

1 2 = λmin(Id) = 1.

As a consequence of Theorem 6.2, for any ε > 0 there exists δ > 0 such that if F(x) F(x ) δ, we have

F(x+) F(x ) ε(F(x) F(x )).

Therefore, we obtain a superlinear convergence rate.

8The rate of SSCN and rate of accelerated subspace descent methods are not directly comparable while the (local) rate of SSCN might be better than rate of ACD, the reverse might happen as well. However, both ACD and SSCN are faster than nonaccelerated subspace descent.

7. Applications

7.1. Linear Models

Consider only S = Id (:,S) for simplicity. Let

i=1 φi( ai, x ) + ψ(x), (15)

and f(x) := 1 n Pn i=1 φi( ai, x ) and suppose that | 3φi(y)| c. Then clearly,

3f(x)[h]3 = 1

i=1 3φi( ai, x ) ai, h 3

for any h Rd. While evaluating E := max h =1,x 3f(x)[h]3 is infeasible, we might bound it instead via

E max h =1 c n

i=1 | ai, h |3 c

i=1 ai 3, (16)

which means that M = c

n Pn i=1 ai 3 is a feasible choice. On the other hand, for S = {j} we have

max hj =1,x 3f(x)[hj]3 = max x 3f(x)[ej]3 c

and thus we might set Mj = c n Pn i=1 |aij|3. The next lemma compares the above choices of M and Mj.

Lemma 7.1. We have M maxj Mj. At the same time, there exist vectors {ai} that

max j Mj = M

Proof. The ﬁrst part is trivial. For the second part, consider ai,j { 1, 1}.

Remark 7. One might avoid the last inequality from (16) using polynomial optimization; however, this might be more expensive than solving the original optimization problem and thus is not preferable. Another strategy would be to use a line search, see Section 3.1.

Both the formula for M and the formula for Mj require the prior knowledge of c 0 such that | 3φi(y)| c for all i. The next lemma shows how to compute such c for the logistic regression (binary classiﬁcation model).

Lemma 7.2. Let φi(y) = log(1 + e biy), where bi { 1, 1}. Then c = 1 6

Proof. 3φi(y) = ex(ex 1)

(1+ex)3 3φi(y) 1 6

Stochastic Subspace Cubic Newton Method

Cost of performing a single iteration For the sake of simplicity, let τ(S) = 1, ψ = 0. Any CD method (i.e,. method with update rule (2) with S {e1, . . . , ed}) can be efﬁciently implemented by memorizing the residuals ai, xk , which is cheap to track since xk+1 xk is a sparse vector. The overall cost of updating the residuals is O(n) while the cost of computing if(x) and 2 i,if(x) (given the residuals are stored) is O(n). Therefore the overall cost of performing a single iteration is O(n). Generalizing to τ(S) = τ 1, the overall cost of single iteration of SSCN can be estimated as O(nτ 2 + τ 3), where O(nτ 2) comes from evaluating subspace gradient and Hessian, while O(τ 3) comes from solving the cubic subproblem.

7.2. Dual of linear models

So far, all results and applications for CRDS we mentioned were problems with large model size d. In this section we describe how SSCN can be efﬁcient to tackle big data problems in some settings. Let A Rn d is data matrix and consider a speciﬁc instance of (15) where

min x Rd FP (x) := 1

i=1 ρi(A(:,i)x) + λ

2 x 2. (17)

where ρi is convex for all i. One can now formulate a dual problem of (17) as follows:

max y Rn FD(y) := 1 2λn2 A y 2 1

i=1 ρ i (e i x). (18)

Note that (18) is of form (15), and therefore if ρ i has Lipschitz Hessian, we can apply SSCN to efﬁciently solve it (same as Section 7.1). Given the solution of (18), we can recover the solution of (17) (duality theory). Thus, SSCN can be used as a data-stochastic method to solve ﬁnite-sum optimization problems.

The trick described in this section is rather well known. It was ﬁrst used in (Shalev-Shwartz & Zhang, 2013), where CD applied to the problem (18) (SDCA) was shown to be competitive with the variance reduced methods like SAG (Roux et al., 2012), SVRG (Johnson & Zhang, 2013) or SAGA (Defazio et al., 2014).

8. Experiments

We now numerically verify our theoretical claims. Due to space limitation, we only present a fraction of all experiments here, the remaining part, together with the exact setup for this experiment can be found in Section B of the Appendix.

We consider binary classiﬁcation with LIBSVM (Chang & Lin, 2011) data modelled by regularized logistic regression. We compare SSCN against three different instances of (ﬁrstorder) randomized coordinate descent: CD with uniform

sampling, CD with importance sampling (Nesterov, 2012), and accelerated CD with importance sampling (Allen-Zhu et al., 2016; Nesterov & Stich, 2017).

In order to be comparable with the mentioned ﬁrst-order methods, we consider S {e1, . . . , ed} with probability 1 the complexity of performing each iteration is about the same for each algorithm now. At the same time, computing Mei for all 1 i d is of cost O(nd) the same cost as computing coordinate-wise smoothness constants for (accelerated) coordinate descent (see Section 7.1 for the details). Figure 3 shows the result.

Figure 1. Comparison of CD with uniform sampling, CD with importance sampling, accelerated CD with importance sampling and SSCN with uniform sampling on Lib SVM datasets.

In all examples, SSCN outperformed CD with uniform sampling. Moreover, the performance of SSCN was always either about the same or signiﬁcantly better to CD with importance sampling. Furthermore, SSCN was also competitive to accelerated CD with importance sampling (in about half of the cases, SSCN was faster, while in the other half, accelerated CD was faster).

The next experiment studies the effect of τ(S) on the conver-

Stochastic Subspace Cubic Newton Method

gence. We SSCN against fastest non-accelerated ﬁrst-order method SDNA, both with varying τ(S). Figure 2 presents the result. As expected, SSCN has outperformed SDNA.

Figure 2. SSCN vs. SDNA on Lib SVM datasets. All algorithms with uniform sampling. Legend indicates the values of τ(S).

9. Future Work

Lastly, we list several possible extensions of our work.

Acceleration. We believe it would be valuable to incorporate Nesterov s momentum into Algorithm 1. Ideally, one would like to get the global rate in between convergence rate of accelerated cubic regularized Newton (Nesterov, 2008) and accelerated CD (Allen-Zhu et al., 2016; Nesterov & Stich, 2017). On the other hand, the local rate (for strongly convex objectives) should recover accelerated sketch-andproject (Tu et al., 2017; Gower et al., 2018). If accelerated sketch-and-project is optimal (this is yet to be established), then accelerated SSCN (again, given that it recovers accelerated sketch-and-project) would be a locally optimal algorithm as well.

Non-separable ψ. As mentioned in Section 5.1, one should not hope for linear convergence of SSCN if ψ is not separable, as the iterates can jump away from the optimum in such case. This issue has been resolved for ﬁrst-order methods using control variates (Hanzely et al., 2018), resulting in SEGA. Therefore, the development of second-order SEGA remains an interesting open problem.

Inexact method. SSCN is applicable in the setup, where function f is accessible via zeroth-order oracle only. In such a case, for any S Rτ d we can estimate Sf(x) and 2 Sf(x) using O(τ 2) function value evaluations. However, since both Sf(x) and 2 Sf(x) are only evaluated inexactly, a slight modiﬁcation of our theory is required.

Non-uniform sampling. Note that our local theory allows for arbitrary non-uniform distribution of S, which might be potentially exploited. At the same time, in some applications, it might be feasible to use a greedy selection rule for S (our theory does not support that).

While developing optimal and implementable importance sampling for the local convergence is beyond the scope of

this paper,9 we sketch several possible sampling strategies that might yield faster convergence.10

Let P(S {e1, e2, . . . , ed}) = 1. If we evaluate the diagonal of the Hessian close to optimum (cost O(nd) for linear models) and sample proportionally to it, we obtain local linear rate with leading complexity term Tr( 2f(x )) λmin 2f(x ).

It is unclear how to design an efﬁcient importance sampling for minibatch (i.e., 1 < E [τ(S)] < d) methods. Determinantal point processes (DPP) (Rodomanov & Kropotov, 2019; Mutn y et al., 2019) were proposed to speed up SDNA from (Qu et al., 2016) (i.e., analogous CD with static matrix upper bound) we thus believe they might be applicable on our setting too. However, in such a case, one would need to evaluate the whole Hessian close to optimum, which is infeasible for applications where d is large.

It is known that SDNA (see related literature) is faster than minibatch CD under the ESO assumption (Qu & Richt arik, 2016a;b). Therefore, we might instead apply minibatch importance sampling for ESO assumption from (Hanzely & Richt arik, 2019) (which corresponds to optimizing the upper bound on iteration complexity). Using the mentioned sampling, we only require evaluating the diagonal of Hessian at some point close to optimum, which is of the same cost as computing the full gradient for linear models thus is feasible.

It is a natural question to ask whether one can speed up the convergence using a greedy rule instead of the random one. For standard CD, greedy rule was shown to have a superior iteration complexity to any randomized rule (Nutini et al., 2015; Karimireddy et al., 2019). For simplicity, consider case where P(S {e1, e2, . . . , ed}) = 1. Far from the optimum, (approximate) greedy rule at iteration k chooses index i = argmaxj | jf(xk)| 3 2 M 1

2 ej . Close to optimum, if a diagonal of a Hessian was evaluated, (approximate) greedy index would be argmaxj | jf(xk)|2 j,jf(x) 1. For linear models, both of the mentioned cases are implementable using the efﬁcient nearest neighbour search (Dhillon et al., 2011) with sublinear complexity in terms of d.

Acknowledgements

The work of the second and the fourth author was supported by ERC Advanced Grant 788368.

9As this is still an open problem even for sketch-andproject (Gower & Richt arik, 2015). 10This only applies to the local results as the global convergence requires some uniformity; see Assumption 5.1.

Stochastic Subspace Cubic Newton Method

Allen-Zhu, Z., Qu, Z., Richt arik, P., and Yuan, Y. Even faster accelerated coordinate descent using non-uniform sampling. In International Conference on Machine Learning, pp. 1110 1119, 2016.

Byrd, R. H., Chin, G. M., Neveitt, W., and Nocedal, J. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977 995, 2011.

Carmon, Y. and Duchi, J. Gradient descent ﬁnds the cubicregularized nonconvex Newton step. SIAM Journal on Optimization, 29(3):2146 2178, 2019.

Cartis, C. and Scheinberg, K. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming, 169 (2):337 375, 2018.

Cartis, C., Gould, N. I., and Toint, P. L. Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Mathematical Programming, 127(2):245 295, 2011a.

Cartis, C., Gould, N. I., and Toint, P. L. Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function-and derivative-evaluation complexity. Mathematical Programming, 130(2):295 319, 2011b.

Chang, C.-C. and Lin, C.-J. Libsvm: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):27, 2011.

Chang, K.-W., Hsieh, C.-J., and Lin, C.-J. Coordinate descent method for large-scale l2-loss linear support vector machines. Journal of Machine Learning Research, 9(Jul): 1369 1398, 2008.

Conn, A. R., Gould, N. I., and Toint, P. L. Trust region methods, volume 1. Siam, 2000.

Defazio, A., Bach, F., and Lacoste-Julien, S. SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in neural information processing systems, pp. 1646 1654, 2014.

Dhillon, I. S., Ravikumar, P. K., and Tewari, A. Nearest neighbor based greedy coordinate descent. In Advances in Neural Information Processing Systems, pp. 2160 2168, 2011.

Doikov, N. and Nesterov, Y. Minimizing uniformly convex functions by cubic regularization of Newton method. ar Xiv preprint ar Xiv:1905.02671, 2019.

Doikov, N. and Richt arik, P. Randomized block cubic Newton method. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1290 1298, Stockholmsmssan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings.mlr. press/v80/doikov18a.html.

Erdogdu, M. A. and Montanari, A. Convergence rates of sub-sampled Newton methods. In Advances in Neural Information Processing Systems 28, 2015.

Gould, N. I., Robinson, D. P., and Thorne, H. S. On solving trust-region and other regularised subproblems in optimization. Mathematical Programming Computation, 2 (1):21 57, 2010.

Gower, R., Hanzely, F., Richt arik, P., and Stich, S. U. Accelerated stochastic matrix inversion: general theory and speeding up BFGS rules for faster second-order optimization. In Advances in Neural Information Processing Systems, pp. 1619 1629, 2018.

Gower, R. M. and Richt arik, P. Stochastic dual ascent for solving linear systems. ar Xiv:1512.06890, 2015.

Gower, R. M. and Richt arik, P. Randomized iterative methods for linear systems. SIAM Journal on Matrix Analysis and Applications, 36(4):1660 1690, 2015.

Gower, R. M. and Richt arik, P. Randomized quasi-Newton updates are linearly convergent matrix inversion algorithms. SIAM Journal on Matrix Analysis and Applications, 38(4):1380 1409, 2017.

Gower, R. M., Kovalev, D., Lieder, F., and Richt arik, P. RSN: Randomized Subspace Newton. In Wallach, H., Larochelle, H., Beygelzimer, A., d Alch e Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 32, pp. 616 625. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/ 8351-rsn-randomized-subspace-newton. pdf.

Grapiglia, G. and Nesterov, Y. Regularized Newton methods for minimizing functions with H older continuous Hessians. SIAM Journal on Optimization, 27(1):478 506, 2017.

Grapiglia, G. N. and Nesterov, Y. Accelerated regularized Newton methods for minimizing composite convex functions. SIAM Journal on Optimization, 29(1):77 99, 2019.

Griewank, A. The modiﬁcation of Newtons method for unconstrained optimization by bounding cubic terms. Technical report, Department of Applied Mathematics

Stochastic Subspace Cubic Newton Method

and Theoretical Physics, University of Cambridge, 1981. Technical Report NA/12.

Hanzely, F. and Richt arik, P. Accelerated coordinate descent with arbitrary sampling and best rates for minibatches. In Proceedings of Machine Learning Research, pp. 304 312. PMLR, 2019.

Hanzely, F., Mishchenko, K., and Richt arik, P. SEGA: Variance reduction via gradient sketching. In Advances in Neural Information Processing Systems, pp. 2082 2093, 2018.

Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315 323, 2013.

Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795 811. Springer, 2016.

Karimireddy, S. P., Stich, S. U., and Jaggi, M. Global linear convergence of Newton s method without strongconvexity or lipschitz gradients. ar Xiv preprint ar Xiv:1806.00413, 2018.

Karimireddy, S. P., Koloskova, A., Stich, S. U., and Jaggi, M. Efﬁcient greedy coordinate descent for composite problems. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 2887 2896, 2019.

Kohler, J. M. and Lucchi, A. Sub-sampled cubic regularization for non-convex optimization. In Proceedings of the 34th International Conference on Machine Learning Volume 70, pp. 1895 1904. JMLR. org, 2017.

Kovalev, D., Richt arik, P., Gorbunov, E., and Gasanov, E. Stochastic spectral and conjugate descent methods. In Advances in Neural Information Processing Systems, pp. 3358 3367, 2018.

Kovalev, D., Mishchenko, K., and Richt arik, P. Stochastic Newton and cubic Newton methods with simple local linear-quadratic rates. ar Xiv preprint ar Xiv:1912.01597, 2019.

Kozak, D., Becker, S., Doostan, A., and Tenorio, L. Stochastic subspace descent. ar Xiv preprint ar Xiv:1904.01145, 2019.

Monteiro, R. D. and Svaiter, B. F. An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM Journal on Optimization, 23(2):1092 1125, 2013.

Mutn y, M., Derezi nski, M., and Krause, A. Convergence analysis of the randomized Newton method with determinantal sampling. ar Xiv preprint ar Xiv:1910.11561, 2019.

Nesterov, Y. Accelerating the cubic regularization of Newtons method on convex problems. Mathematical Programming, 112(1):159 181, 2008.

Nesterov, Y. Efﬁciency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341 362, 2012.

Nesterov, Y. Lectures on convex optimization, volume 137. Springer, 2018.

Nesterov, Y. Inexact basic tensor methods. CORE Discussion Papers 2019/23, 2019.

Nesterov, Y. and Polyak, B. T. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177 205, 2006.

Nesterov, Y. and Stich, S. U. Efﬁciency of the accelerated coordinate descent method on structured optimization problems. SIAM Journal on Optimization, 27(1):110 123, 2017.

Nutini, J., Schmidt, M., Laradji, I., Friedlander, M., and Koepke, H. Coordinate descent converges faster with the gauss-southwell rule than random selection. In International Conference on Machine Learning, pp. 1632 1641, 2015.

Pilanci, M. and Wainwright, M. J. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205 245, 2017.

Qu, Z. and Richt arik, P. Coordinate descent with arbitrary sampling I: Algorithms and complexity. Optimization Methods and Software, 31(5):829 857, 2016a.

Qu, Z. and Richt arik, P. Coordinate descent with arbitrary sampling II: Expected separable overapproximation. Optimization Methods and Software, 31(5):858 884, 2016b.

Qu, Z., Richt arik, P., Tak aˇc, M., and Fercoq, O. SDNA: stochastic dual Newton ascent for empirical risk minimization. In International Conference on Machine Learning, pp. 1823 1832, 2016.

Richt arik, P. and Tak aˇc, M. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144 (1-2):1 38, 2014.

Stochastic Subspace Cubic Newton Method

Richt arik, P. and Tak aˇc, M. Stochastic reformulations of linear systems: algorithms and convergence theory. ar Xiv preprint ar Xiv:1706.01108, 2017.

Rodomanov, A. and Kropotov, D. A randomized coordinate descent method with volume sampling. ar Xiv preprint ar Xiv:1904.04587, 2019.

Roosta-Khorasani, F. and Mahoney, M. W. Sub-sampled Newton methods. Mathematical Programming, 174(1-2): 293 326, 2019.

Roux, N. L., Schmidt, M., and Bach, F. R. A stochastic gradient method with an exponential convergence rate for ﬁnite training sets. In Advances in neural information processing systems, pp. 2663 2671, 2012.

Shalev-Shwartz, S. and Zhang, T. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567 599, 2013.

Stich, S. U., Muller, C. L., and Gartner, B. Optimization of convex functions with random pursuit. SIAM Journal on Optimization, 23(2):1284 1309, 2013.

Tripuraneni, N., Stern, M., Jin, C., Regier, J., and Jordan, M. I. Stochastic cubic regularization for fast nonconvex optimization. In Advances in Neural Information Processing Systems, pp. 2899 2908, 2018.

Tu, S., Venkataraman, S., Wilson, A. C., Gittens, A., Jordan, M. I., and Recht, B. Breaking locality accelerates block gauss-seidel. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3482 3491. JMLR. org, 2017.

Wang, Z., Zhou, Y., Liang, Y., and Lan, G. Stochastic variance-reduced cubic regularization for nonconvex optimization. ar Xiv preprint ar Xiv:1802.07372, 2018.

Wright, S. J. Coordinate descent algorithms. Mathematical Programming, 151(1):3 34, 2015.

Xu, P., Yang, J., Roosta-Khorasani, F., R e, C., and Mahoney, M. W. Sub-sampled Newton methods with non-uniform sampling. In Advances in Neural Information Processing Systems, pp. 3000 3008, 2016.

Xu, P., Roosta, F., and Mahoney, M. W. Newton-type methods for non-convex optimization under inexact hessian information. Mathematical Programming, pp. 1 36, 2017.