# randomized_block_cubic_newton_method__d973594d.pdf

Randomized Block Cubic Newton Method

Nikita Doikov 1 Peter Richt arik 2 3 4

We study the problem of minimizing the sum of three convex functions a differentiable, twicedifferentiable and a non-smooth term in a high dimensional setting. To this effect we propose and analyze a randomized block cubic Newton (RBCN) method, which in each iteration builds a model of the objective function formed as the sum of the natural models of its three components: a linear model with a quadratic regularizer for the differentiable term, a quadratic model with a cubic regularizer for the twice differentiable term, and perfect (proximal) model for the nonsmooth term. Our method in each iteration minimizes the model over a random subset of blocks of the search variable. RBCN is the ﬁrst algorithm with these properties, generalizing several existing methods, matching the best known bounds in all special cases. We establish O(1/ϵ), O(1/ ϵ) and O(log(1/ϵ)) rates under different assumptions on the component functions. Lastly, we show numerically that our method outperforms the state of the art on a variety of machine learning problems, including cubically regularized leastsquares, logistic regression with constraints, and Poisson regression.

1. Introduction

In this paper we develop an efﬁcient randomized algorithm for solving an optimization problem of the form

min x Q F(x) def = g(x) + φ(x) + ψ(x), (1)

1National Research University Higher School of Economics, Samsung-HSE Laboratory, Moscow, Russia 2King Abdullah University of Science and Technology, Thuwal, Saudi Arabia 3University of Edinburgh, Edinburgh, United Kingdom 4Moscow Institute of Physics and Technology, Dolgoprudny, Russia. Correspondence to: Nikita Doikov <nikita.doikov@gmail.com>, Peter Richt arik <peter.richtarik@kaust.edu.sa, peter.richtarik@ed.ac.uk>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

where Q RN is a closed convex set, and g, φ and ψ are convex functions with different smoothness and structural properties. Our aim is to capitalize on these different properties in the design of our algorithm. We assume that g has Lipschitz gradient1, φ has Lipschitz Hessian, while ψ is allowed to be nonsmooth, albeit simple .

1.1. Block Structure

Moreover, we assume that the N coordinates of x are partitioned into n blocks of sizes N1, . . . , Nn, with P

i Ni = N, and then write x = (x(1), . . . , x(n)), where x(i) RNi. This block structure is typically dictated by the particular application considered. Once the block structure is ﬁxed, we further assume that φ and ψ are block separable. That is, φ(x) = Pn i=1 φi(x(i)) and ψ(x) = Pn i=1 ψi(x(i)), where φi are twice differentiable with Lipschitz Hessians, and ψi are closed convex (and possibly nonsmooth) functions.

Revealing this block structure, problem (1) takes the form

min x Q F(x) def = g(x) +

i=1 φi(x(i)) +

i=1 ψi(x(i)). (2)

We are speciﬁcally interested in the case when n is big, in which case it make sense to update a small number of the block in each iteration only.

1.2. Related Work

There has been a substantial and growing volume of research related to second-order and block-coordinate optimization. In this part we brieﬂy mention some of the papers most relevant to the present work.

A major leap in second-order optimization theory was made since the cubic Newton method was proposed by Griewank (1981) and independently rediscovered by Nesterov & Polyak (2006), who also provided global complexity guarantees.

Cubic regularization was equipped with acceleration by Nesterov (2008), adaptive stepsizes by (Cartis et al., 2011a;b) and extended to a universal framework by Grapiglia & Nesterov (2017). The universal schemes can automatically

1Our assumption is bit more general than this; see Assumptions 1, 2 for details.

Randomized Block Cubic Newton Method

adjust to the implicit smoothness level of the objective. Cubically regularized second-order schemes for solving systems of nonlinear equations were developed by Nesterov (2007) and randomized variants for stochastic optimization were considered by Tripuraneni et al. (2017); Ghadimi et al. (2017); Kohler & Lucchi (2017); Cartis & Scheinberg (2018).

Despite their attractive global iteration complexity guarantees, the weakness of second-order methods in general, and cubic Newton in particular, is their high computational cost per iteration. This issue remains the subject of active research. For successful theoretical results related to the approximation of the cubic step we refer to (Agarwal et al., 2016) and (Carmon & Duchi, 2016).

At the same time, there are many successful attempts to use block coordinate randomization to accelerate ﬁrstorder (Tseng & Yun, 2009; Richt arik & Tak aˇc, 2014; 2016) and second-order (Qu et al., 2016; Mutn y & Richt arik, 2018) methods.

In this work we are addressing the issue of combining blockcoordinate randomization with cubic regularization, to get a second-order method with proven global complexity guarantees and with a low cost per iteration.

A powerful advance in convex optimization theory was the advent of composite or proximal ﬁrst-order methods (see (Nesterov, 2013) as a modern reference). This technique has become available as an algorithmic tool in block coordinate setting as well (Richt arik & Tak aˇc, 2014; Qu et al., 2016). Our aim in this work is the development of a composite cubically regularized second-order method.

1.3. Contributions

We propose a new randomized second-order proximal algorithm for solving convex optimization problems of the form (2). Our method, Randomized Block Cubic Newton (RBCN) (see Algorithm 1) treats the three functions appearing in (1) differently, according to their nature.

Our method is a randomized block method because in each iteration we update a random subset of the n blocks only. This facilitates faster convergence, and is suited to problems where n is very large. Our method is proximal because we keep the functions ψi in our model, which is minimized in each iteration, without any approximation. Our method is a cubic Newton method because we approximate each φi using a cubically-regularized second order model.

We are not aware of any method that can solve (2) via using the most appropriate models of the three functions (quadratic with a constant Hessian for g, cubically regularized quadratic for φ and no model for ψ), not even in the case n = 1.

Our approach generalizes several existing results:

In the case when n = 1, g = 0 and ψ = 0, RBCN reduces to the cubically-regularized Newton method of Nesterov & Polyak (2006). Even when n = 1, RBCN can be seen as an extension of this method to composite optimization. For n > 1, RBCN provides an extension of the algorithm in Nesterov & Polyak (2006) to the randomized block coordinate setting, popular for highdimensional problems.

In the special case when φ = 0 and Ni = 1 for all i, RBCN specializes to the stochastic Newton (SN) method of Qu et al. (2016). Applied to the empirical risk minimization problem (see Section 7), our method has a dual interpretation (see Algorithm 2). In this case, our method reduces to the stochastic dual Newton ascent method (SDNA) also described in (Qu et al., 2016). Hence, RBCN can be seen as an extension of SN and SDNA to blocks of arbitrary sizes, and to the inclusion of the twice differentiable term φ.

In the case when φ = 0 and the simplest over approximation of g is assumed: 0 2g(x) LI, the composite block coordinate gradient method Tseng & Yun (2009) can be applied to solve (1). Our method extends this in two directions: we add twice-differentiable terms φ, and use a tighter model for g, using all global curvature information (if available).

We prove high probability global convergence guarantees under several regimes, summarized next:

Under no additional assumptions on g, φ and ψ beyond convexity (and either boundedness of Q, or boundedness of the level sets of F on Q), we prove the rate

where τ is the mini-batch size (see Theorem 1).

Under certain conditions combining the properties of g with the way the random blocks are sampled, formalized by the assumption β > 0 (see (12) for the deﬁnition of β), we obtain the rate

O n τ max{1, β} ϵ

(see Theorem 2). In the special case when n = 1, we necessarily have τ = 1 and β = µ/L (reciprocal of the condition number of g) we get the rate O( L µ ϵ). If g is quadratic and τ = n, then β = 1 and the resulting complexity O( 1 ϵ) recovers the rate of cubic Newton established by Nesterov & Polyak (2006).

Randomized Block Cubic Newton Method

Finally, if g is strongly convex, the above result can be improved (see Theorem 3) to

O n τ max{1, β} log 1

1.4. Contents

The rest of the paper is organized as follows. In Section 2 we introduce the notation and elementary identities needed to efﬁciently handle the block structure of our model. In Section 3 we make the various smoothness and convexity assumptions on g and φi formal. Section 4 is devoted to the description of the block sampling process used in our method, along with some useful identities. In Section 5 we describe formally our randomized block cubic Newton (RBCN) method. Section 6 is devoted to the statement and description of our main convergence results, summarized in the introduction. Missing proofs are provided in the supplementary material. In Section 7 we show how to apply our method to the empirical risk minimization problem. Applying RBCN to its dual leads to Algorithm 2. Finally, our numerical experiments on synthetic and real datasets are described in Section 8.

2. Block Structure

To model a block structure, we decompose the space RN

into n subspaces in the following standard way. Let U RN N be a column permutation of the N N identity matrix I and let a decomposition U = [U1, U2, . . . , Un] be given, where Ui RN Ni are n submatrices, N = Pn i=1 Ni. Subsequently, any vector x RN can be

uniquely represented as x = Pn i=1 Uix(i), where x(i) def = UT i x RNi.

In what follows we will use the standard Euclidean inner product: x, y def = P

i xiyi, Euclidean norm of a vector:

x, x and induced spectral norm of a matrix:

A def = max x =1 Ax . Using block decomposition, for two vectors x, y RN we have:

i=1 Uix(i),

i=1 x(i), y(i) .

For a given nonempty subset S of [n] def = {1, . . . , n} and for any vector x RN we denote by x[S] RN the vector obtained from x by retaining only blocks x(i) for which i S and zeroing all other:

x[S] def = X

i S Uix(i) = X

i S Ui UT i x.

Furthermore, for any matrix A RN N we write A[S] RN N for the matrix obtained from A by retaining only

elements whose indices are both in some coordinate blocks from S, formally:

i S Ui UT i

i S Ui UT i

Note that these deﬁnitions imply that

A[S]x, y = Ax[S], y[S] , x, y RN.

Next, we deﬁne the block-diagonal operator, which, up to permutation of coordinates, retains diagonal blocks and nulliﬁes the off-diagonal blocks:

blockdiag(A) def =

i=1 Ui UT i AUi UT i =

i=1 A[{i}].

Finally, denote RN [S] def = x[S] | x RN . This is a linear subspace of RN composed of vectors which are zero in blocks i / S.

3. Assumptions

In this section we formulate our main assumptions about differentiable components of (2) and provide some examples to illustrate the concepts.

We assume that g : RN R is a differentiable function and all φi : RNi R, i [n] are twice differentiable. Thus, at any point x RN we should be able to compute all the gradients { g(x), φ1(x(1)), . . . , φn(x(n))} and the Hessians { 2φ1(x(1)), . . . , 2φn(x(n))}, or at least their actions on arbitrary vector h of appropriate dimension.

Next, we formalize our assumptions about convexity and level of smoothness. Speaking informally, g is similar to a quadratic, and functions φi are arbitrary twice-differentiable and smooth.

Assumption 1 (Convexity) There is a positive semideﬁnite matrix G 0 such that for all x, h RN:

g(x + h) g(x) + g(x), h + 1

2 Gh, h , (3)

φi(x(i) + h(i)) φi(x(i)) + φi(x(i)), h(i) , i [n].

In the special case when G = 0, (3) postulates convexity. For positive deﬁnite G, the objective will be strongly convex with the strong convexity parameter µ def = λmin(G) > 0.

Note that for all φi we only require convexity. However, if we happen to know that any φi is strongly convex (λmin( 2φi(y)) µi > 0 for all y RNi), we can move this strong convexity to g by subtracting µi

2 x(i) 2 from φi and adding it to g. This extra knowledge may in some particular cases improve convergence guarantees for our algorithm, but does not change the actual computations.

Randomized Block Cubic Newton Method

Assumption 2 (Smoothness of g) There is a positive semideﬁnite matrix A 0 such that for all x, h RN:

g(x + h) g(x) + g(x), h + 1

2 Ah, h . (4)

The main example of g is a quadratic function g(x) = 1 2 Mx, x with a symmetric positive semideﬁnite M RN N for which both (3) and (4) hold with G = A = M.

Of course, any convex g with Lipschitz-continuous gradient with a constant L 0 satisﬁes (3) and (4) with G = 0 and A = LI (Nesterov, 2004).

Assumption 3 (Smoothness of φi) For every i [n] there is a nonnegative constant Hi 0 such that the Hessian of φi is Lipschitz-continuous:

2φi(x + h) 2φi(x) Hi h , (5)

for all x, h RNi.

Examples of functions which satisfy (5) with a known Lipschitz constant of Hessian H are quadratic: φ(t) = Ct t0 2 (H = 0 for all the parameters), cubed norm: φ(t) = (1/3) t t0 3 (H = 2, see Lemma 5 in (Nesterov, 2008)), logistic regression loss: φ(t) = log(1 + exp(t)) (H = 1/(6

3), see Proposition 1 in the supplementary material).

For a ﬁxed set of indices S [n] denote

φS(x) def = X

i S φi(x(i)), x RN.

Then we have:

φS(x), h = X

i S φi(x(i)), h(i) , x, h RN,

2φS(x)h, h = X

i S 2φi(x(i))h(i), h(i) , x, h RN.

Lemma 1 If Assumption 3 holds, then for all x, h RN

we have the following second-order approximation bound: φS(x + h) φS(x) φS(x), h 1

2 2φS(x)h, h

max i S {Hi} h[S] 3. (6)

From now on we denote HF def = max{H1, H2, . . . , Hn}.

4. Sampling of Blocks

In this section we summarize some basic properties of sampling ˆS, which is a random set-valued mapping with values being subsets of [n]. For a ﬁxed block-decomposition,

with each sampling ˆS we associate the probability matrix P RN N as follows: an element of P is the probability of choosing a pair of blocks which contains indices of this element. Denoting by E RN N the matrix of all ones, we have P = E E[ ˆS] . Wel restrict our analysis to uniform samplings, deﬁned next.

Assumption 4 (Uniform sampling) Sampling ˆS is uniform, i.e., P(i ˆS) = P(j ˆS) def = p, for all i, j [n].

The above assumpotion means that the diagonal of P is constant: Pii = p for all i [N]. It is easy to see that (Corollary 3.1 in (Qu & Richt arik, 2016)):

E A[ ˆS] = A P, (7)

where denotes the Hadamard product.

Denote τ def = E[| ˆS|] = np (expected minibatch size). The special uniform sampling deﬁned by picking from all subsets of size τ uniformly at random is called τ-nice sampling. If ˆS is τ-nice, then (see Lemma 4.3 in (Qu & Richt arik, 2016))

n ((1 γ) blockdiag(E) + γE) , (8)

where γ = (τ 1)/(n 1).

In particular, the above results in the following:

Lemma 2 For the τ-nice sampling ˆS, we have

E[A[ ˆS]] = τ

blockdiag(A) + τ(τ 1)

Proof: Combine (7) and (8).

5. Algorithm

Due to the problem structure (2) and utilizing the smoothness of the components (see (4) and (5)), for a ﬁxed subset of indices S [n] it is natural to consider the following model of our objective F around a point x RN:

MH,S(x; y) def = F(x) + ( g(x))[S], y + 1

2 A[S]y, y +

+ ( φ(x))[S], y + 1

2 ( 2φ(x))[S]y, y + H

ψi(x(i) + y(i)) ψi(x(i)) . (9)

The above model arises as a combination of a ﬁrst-order model for g with global curvature information provided by matrix A, second-order model with cubic regularization (following (Nesterov & Polyak, 2006)) for φ, and perfect model for the non-differentiable terms ψi (i.e., we keep these terms as they are).

Randomized Block Cubic Newton Method

Combining (4) and (6), and for large enough H (H maxi S Hi is sufﬁcient), we get the global upper bound

F(x + y) MH,S(x; y), x RN, y RN [S].

Moreover, the value of all summands in MH,S(x; y) depends on the subset of blocks {y(i)|i S} only, and therefore TH,S(x) def = argmin y RN [S] subject to x+y Q

MH,S(x; y) (10)

can be computed efﬁciently for small |S| and as long as Q is simple (for example, afﬁne). Denote the minimum of the cubic model by M H,S(x) def = MH,S(x; TH,S(x)). The RBCN method performs the update x x + TH,S(x), and is formalized as Algorithm 1.

Algorithm 1 RBCN: Randomized Block Cubic Newton

1: Parameters: sampling distribution ˆS 2: Initialization: choose initial point x0 Q 3: for k = 0, 1, 2, . . . do 4: Sample Sk ˆS 5: Find Hk (0, 2HF ] such that F(xk + THk,Sk(xk)) M Hk,Sk(xk)

6: Make the step xk+1 def = xk + THk,Sk(xk) 7: end for

6. Convergence Results

In this section we establish several convergence rates for Algorithm 1 under various structural assumptions: for the general class of convex problems, and for the more speciﬁc strongly convex case. We will focus on the family of uniform samplings only, but generalizations to other sampling distributions are also possible.

6.1. Convex Loss

We start from the general situation where the term g(x) and all the φi(x(i)) and ψi(x(i)), i [n] are convex, but not necessary strongly convex.

Denote by D the maximum distance from an optimum point x to the initial level set:

D def = sup n x x x Q, F(x) F(x0) o .

Theorem 1 Let Assumptions 1, 2, 3, 4 hold. Let solution x Q of problem (1) exist, and assume the level sets are bounded: D < + . Choose required accuracy ε > 0 and conﬁdence level ρ (0, 1). Then after

max n LD2 +HF D3, F(x0) F o

iterations of Algorithm 1, where L def = λmax(A), we have

P F(x K) F ε 1 ρ.

Given theoretical result provides global sublinear rate of convergence, with iteration complexity of the order O 1/ε .

Note that for a case φ(x) 0 we can put HF = 0, and Theorem 1 in this situation restates well-known result about convergence of composite gradient-type block-coordinate methods (see, for example, (Richt arik & Tak aˇc, 2014)).

6.2. Strongly Convex Loss

Here we study the case when the matrix G from the convexity assumption (3) is strictly positive deﬁnite: G 0, which means that the objective F is strongly convex with a constant µ def = λmin(G) > 0.

Denote by β a condition number for the function g and sampling distribution ˆS: the maximum nonnegative real number such that

β ES ˆS A[S] τ

If (12) holds for all nonnegative β we put by deﬁnition β + .

A simple lower bound exists: β µ L > 0, where L = λmax(A), as in Theorem 1. However, because (12) depends not only on g, but also on sampling distribution ˆS, it is possible that β > 0 even if µ = 0 (for example, β = 1 if P(S = [n]) = 1 and A = G = 0).

The following theorems describe global iteration complexity guarantees of the order O 1/ ε and O log(1/ε) for Algorithm 1 in the cases β > 0 and µ > 0 correspondingly, which is an improvement of general O 1/ε .

Theorem 2 Let Assumptions 1, 2, 3, 4 hold. Let solution x Q of problem (1) exist, let level sets be bounded: D < + , and assume that β, which is deﬁned by (12), is greater than zero. Choose required accuracy ε > 0 and conﬁdence level ρ (0, 1). Then after

K 2 ε n τ 1 σ

max n HF D3, F(x0) F o

(13) iterations of Algorithm 1, where σ def = min{β, 1} > 0, we have P F(x K) F ε 1 ρ.

Theorem 3 Let Assumptions 1, 2, 3, 4 hold. Let solution x Q of problem (1) exist, and assume that µ def = λmin(G)

Randomized Block Cubic Newton Method

is strictly positive. Then after

2 log F(x0) F

iterations of Algorithm 1, we have

P F(x K) F ε 1 ρ.

Given complexity estimates show which parameters of the problem directly affect on the convergence of the algorithm.

Bound (13) improves initial estimate (11) by the factor p

D0/ε. The cost is additional term σ 1 = (min{β, 1}) 1, which grows up while the condition number β becomes smaller.

Opposite and limit case is when the quadratic part of the objective is vanished (g(x) 0 σ = 1). Algorithm 1 is turned to be a parallelized block-independent minimization of the objective components via cubically regularized Newton steps. Then, the complexity estimate coincides with a known result (Nesterov & Polyak, 2006) in a nonrandomized (τ = n, ρ 1) setting.

Bound (14) guarantees a linear rate of convergence, which means logarithmic dependence on required accuracy ε for the number of iterations. The main complexity factor becomes a product of two terms: σ 1 max{HF D/µ, 1}1/2. For the case φ(x) 0 we can put HF = 0 and get the stochastic Newton method from (Qu et al., 2016) with its global linear convergence guarantee.

Despite the fact that linear rate is asymptotically better than sublinear, and O 1/ ε is asymptotically better than O 1/ε , we need to take into account all the factors, which may slow down the algorithm. Thus, while µ = λmin(G) 0, estimate (13) is becoming better than (14), as well as (11) is becoming better than (13) while β 0.

6.3. Implementation Issues

Let us explain how one step of the method can be performed, which requires the minimization of the cubic model (10), possibly with some simple convex constraints.

The ﬁrst and the classical approach was proposed in (Nesterov & Polyak, 2006) and, before for trust-region methods, in (Conn et al., 2000). It works with unconstrained (Q RN) and differentiable case (ψ(x) 0). Firstly we need to ﬁnd a root of a special one-dimensional nonlinear equation (this can be done, for example, by simple Newton iterations). After that, we just solve one linear system to produce a step of the method. Then, total complexity of solving the subproblem can be estimated as O(d3) arithmetical operations, where d is the dimension of subproblem,

in our case: d = |S|. Since some matrix factorization is used, the cost of the cubically regularized Newton step is actually similar by efﬁciency to the classical Newton one. See also (Gould et al., 2010) for detailed analysis. For the case of afﬁne constraints, the same procedure can be applied. Example of using this technique is given by Lemma 3 from the next section.

Another approach is based on ﬁnding an inexact solution of the subproblem by the fast approximate eigenvalue computation (Agarwal et al., 2016) or by applying gradient descent (Carmon & Duchi, 2016). Both of these schemes provide global convergence guarantees. Additionally, they are Hessian-free. Thus we need only a procedure of multiplying quadratic part of (9) to arbitrary vector, without storing the full matrix. The latter approach is the most universal one and can be spread to the composite case, by using proximal gradient method or its accelerated variant (Nesterov, 2013).

There are basically two strategies to ﬁnd parameter Hk on every iteration: a constant choice Hk := maxi Sk{Hi} or Hk := HF , if Lipschitz constants of the Hessians are known, or simple adaptive procedure, which performs a truncated binary search and has a logarithmic cost per one step of the method. Example of such procedure can be found in primal-dual Algorithm 2 from the next section.

6.4. Extension of the Problem Class

The randomized cubic model (9), which has been considered and analyzed before, arises naturally from the separable structure (2) and by our smoothness assumptions (4), (5). Let us discuss an interpretation of Algorithm 1 in terms of general problem min x RN F(x) with twice-differentiable

F (omitting non-differentiable component for simplicity). One can state and minimize the model m H,S(x; y) F(x)+ ( F(x))[S], y + 1

2 ( 2F(x))[S]y, y + H

which is a sketched version of the originally proposed Cubic Newton method (Nesterov & Polyak, 2006). For alternative sketched variants of Newton-type methods but without cubic regularization see (Pilanci & Wainwright, 2015).

The latter model m H,S(x; y) is the same as MH,S(x; y) when inequality (4) from the smoothness assumption for g is exact equality, i.e. when the function g is a quadratic with the Hessian matrix 2g(x) A. Thus, we may use m H,S(x; y) instead of MH,S(x; y), which is still computationally cheap for small |S|. However, this model does not give any convergence guarantees for the general F, to the best of our knowledge, unless S = [n]. But it can be a workable approach, when the separable structure (2) is not provided.

Note also, that Assumption 3 about Lipschitz-continuity of the Hessian is not too restrictive. Recent result (Grapiglia

Randomized Block Cubic Newton Method

& Nesterov, 2017) shows that Newton-type methods with cubic regularization and with a standard procedure of adaptive estimation of HF automatically ﬁt the actual level of smoothness of the Hessian without any additional changes in the algorithm.

Moreover, step (10) of the method as the global minimum of MH,S(x; y) is well-deﬁned and computationally tractable even in nonconvex cases (Nesterov & Polyak, 2006). Thus we can try to apply the method to nonconvex objective as well, but without known theoretical guarantees for S = [n].

7. Empirical Risk Minimization

One of the most popular examples of optimization problems in machine learning is empirical risk minimization problem, which in many cases can be formulated as follows:

i=1 φi(b T i w) + λg(w) , (15)

where φi are convex loss functions, g is a regularizer, variables w are weights of a model and m is a size of a dataset.

7.1. Constrained Problem Reformulation

Let us consider the case, when the dimension d of problem (15) is very huge and d m. This asks us to use some coordinate-randomization technique. Note that formulation (15) does not directly ﬁt our problem setup (2), but we can easily transform it to the following constrained optimization problem, by introducing new variables αi b T i w:

min w Rd α Rm

i=1 φi(αi)+λg(w)+

i=1 I{αi = b T i w} . (16)

Following our framework, on every step we will sample a random subset of coordinates S [d] of weights w, build the cubic model of the objective (assuming that φi and g satisfy (4), (5)):

MH,S(w, α; y, h) λ ( g(w))[S], y + 1

2 A[S]y, y +

φ i(αi)hi + 1

2φ i (αi)h2 i + H

6 h 3 + P(w)

and minimize it by y and h on the afﬁne set:

(y , h ) := argmin y Rd [S],h Rm

subject to h=By

MH,S(w, α; y, h), (17)

where rows of matrix B Rm d are b T i . Then, updates of the variables are: w+ := w + y and α+ := α + h .

The following lemma is addressing the issue of how to solve (17), which is required on every step of the method. Its proof can be found in the supplementary material.

Lemma 3 Denote by ˆB Rm |S| the submatrix of B with row indices from S, by ˆA R|S| |S| the submatrix of A with elements whose both indices are from S, by b1 R|S| the subvector of g(w) with element indices from S. Denote vector b2 φ i(αi) m i=1 and b mλb1 + ˆBT b2. Deﬁne the family of matrices Z(τ) : R+ R|S| |S|:

Z(τ) def = mλ ˆA + ˆBT diag(φ i (αi)) + Hτ

Then the solution (y , h ) of (17) can be found from the equations: Z(τ )y S = b, h = ˆBy S, where τ 0 satisﬁes one-dimensional nonlinear equation: τ = ˆB(Z(τ )) b and y S R|S| is the subvector of the solution y with element indices from S.

Thus, after we ﬁnd the root of nonlinear one-dimensional equation, we need to solve |S| |S| linear system to compute y . Then, to ﬁnd h we do one matrix-vector multiplication with the matrix of size m |S| . Matrix B usually has a sparse structure when m is big, which also should be used in effective implementation.

7.2. Maximization of the Dual Problem

Another approach to solving optimization problem (15) is to maximize its Fenchel dual (Rockafellar, 1997):

i=1 φ i ( αi) λg 1

(18) where g and {φ i } are the Fenchel conjugate functions of

g and {φi} respectively, f (s) def = supx[ s, x f(x) ] for arbitrary f. It is know (Bertsekas, 1978), that if φi is twicedifferentiable in a neighborhood of y and 2φi(y) 0 in this area, then its Fenchel conjugate φ i is also twicedifferentiable in some neighborhood of s = φi(y) and it holds: 2φ i (s) = ( 2φi(y)) 1.

Then, in a case of smooth differentiable g and twicedifferentiable φ i , i [m] we can apply our framework to (18), by doing cubic steps in random subsets of the dual variables α Rm. The primal w Rd corresponded to particular α can be computed from the stationary equation

which holds for solutions of primal (15) and dual (18) problems in a case of strong duality.

Let us assume that the function g is 1-strongly convex (which is of course true for ℓ2-regularizer 1/2 w 2 2). Then

for G(α) λg 1 λm BT α the uniform bound for the Hes-

sian exists: 2G(α) 1 λm2 BT B. As before we build the

Randomized Block Cubic Newton Method

randomized cubic model and compute its minimizer (setting Q Tm i=1 dom φ i ):

MH,S(α, h) D(α) + λ g 1

λm BT α , h[S]

+ 1 2λm2 Bh[S] 2 + 1

φ i ( αi), hi +

2 2φ i ( αi)hi, hi + H

6 h[S] 3; S [m],

TH,S(α) argmin h Rm [S] subject to α+h Q

MH,S(α, h),

M H,S(α) MH,S(α, TH,S(α)).

Because in general we may not know exact Lipschitzconstant for the Hessians, we do an adaptive search for estimating H. Resulting primal-dual scheme is presented in Algorithm 2. When a small subset of coordinates S is used, the most expensive operations become: computation of the objective at a current point D(α) and the matrix-vector product BT α. Both of them can be signiﬁcantly optimized by storing already computed values in memory and updating only changed information on every step.

Algorithm 2 Stochastic Dual Cubic Newton Ascent (SDCNA)

1: Parameters: sampling distribution ˆS 2: Initialization: choose initial α0 Q and H0 > 0 3: for k = 0, 1, 2, . . . do 4: Make a primal update wk := g 1

5: Sample Sk ˆS

6: While M Hk,Sk(αk) > D(αk + THk,Sk(αk)) do

7: Hk := 1/2 Hk

8: Make a dual update αk+1 := αk + THk,Sk(xk)

9: Set Hk+1 := 2 Hk 10: end for

8. Numerical Experiments

Synthetic We consider the following synthetic regression

task: min x RN 1 2 Ax b 2 2 +

6 |xi|3 with randomly gen-

erated parameters and for different N. On each problem of this type we run Algorithm 1 and evaluate total computational time until reaching 10 12 accuracy in function residual. Using middle-size blocks of coordinates on each step is the best choice in terms of total computational time, comparing it with small coordinate subsets and with fullcoordinate method. This agrees with the provided complexity estimates for the method: an increase of the batch size

speeds up convergence rate linearly, but slows down the cost of one iteration cubically. Therefore, the optimum size of the block is on a medium level.

10 25 50 100 200 500 1K 2K 4K 8K Block size

Total computational time (s)

Synthetic, tolerance=1e-12

N = 16K N = 12K N = 8K N = 4K

10 25 50 100 200 500 1K 2K 4K 8K Block size

Total computational time (s)

leukemia, tolerance=1e-6

Block coordinate Gradient Descent Block coordinate Cubic Newton

10 25 50 100 200 500 1K 2K 4K 8K Block size

Total computational time (s)

duke breast-cancer, tolerance=1e-6

Block coordinate Gradient Descent Block coordinate Cubic Newton

Figure 1. Time it takes to solve a problem for different sampling block sizes. Left: synthetic problem. Center and right: logistic regression for real data.

Logistic regression In this experiment we train ℓ2regularized logistic regression model for classiﬁcation task with two classes by its constrained reformulation (16) and compare Algorithm 1 with the Block coordinate Gradient Descent on the datasets: leukemia (m = 38, d = 7129) and duke breast-cancer (m = 44, d = 7129). We see that using coordinate blocks of size 25 50 for the Cubic Newton outperforms all other cases of both methods in terms of total computational time. Increasing block size further starts to signiﬁcantly slow down the method because of high cost of every iteration.

Poisson regression In this experiment we train Poisson model for regression task with integer responses by the primal-dual Algorithm 2 and compare it with SDCA (Shalev Shwartz & Zhang, 2013) and SDNA (Qu et al., 2016) methods on synthetic (m = 1000, d = 200) and real data (m = 319, d = 20). Our approach requires smaller number of epochs to reach given accuracy, but computational efﬁciency of every step is the same as in SDNA method.

0 20 40 60 80 100 Epoch

Duality gap

Cubic, 8 Cubic, 32 Cubic, 256 SDNA, 8 SDNA, 32 SDNA, 256 SDCA, 8 SDCA, 32 SDCA, 256

0 100 200 300 400 500 600 Epoch

Duality gap

Montreal bike lanes

Cubic, 8 Cubic, 32 Cubic, 256 SDNA, 8 SDNA, 32 SDNA, 256 SDCA, 8 SDCA, 32 SDCA, 256

Figure 2. Comparison of Algorithm 2 (marked as Cubic) with SDNA and SDCA methods for minibatch sizes τ = 8, 32, 256, training Poisson regression. Left: synthetic. Right: real data.

9. Acknowledgments

The work of the ﬁrst author was partly supported by Samsung Research, Samsung Electronics, the Russian Science Foundation Grant 17-11-01027 and KAUST.

Randomized Block Cubic Newton Method

Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., and Ma, T. Finding approximate local minima for nonconvex optimization in linear time. ar Xiv preprint ar Xiv:1611.01146, 2016.

Bertsekas, D. P. Local convex conjugacy and Fenchel duality. In Preprints of 7th Triennial World Congress of IFAC, Helsinki, Finland, volume 2, pp. 1079 1084, 1978.

Carmon, Y. and Duchi, J. C. Gradient descent efﬁciently ﬁnds the cubic-regularized non-convex newton step. ar Xiv preprint ar Xiv:1612.00547, 2016.

Cartis, C. and Scheinberg, K. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming, 169 (2):337 375, 2018.

Cartis, C., Gould, N. I., and Toint, P. L. Adaptive cubic regularisation methods for unconstrained optimization. part I: motivation, convergence and numerical results. Mathematical Programming, 127(2):245 295, 2011a.

Cartis, C., Gould, N. I., and Toint, P. L. Adaptive cubic regularisation methods for unconstrained optimization. part II: worst-case function-and derivative-evaluation complexity. Mathematical programming, 130(2):295 319, 2011b.

Conn, A. R., Gould, N. I., and Toint, P. L. Trust region methods. SIAM, 2000.

Ghadimi, S., Liu, H., and Zhang, T. Second-order methods with cubic regularization under inexact information. ar Xiv preprint ar Xiv:1710.05782, 2017.

Gould, N. I., Robinson, D. P., and Thorne, H. S. On solving trust-region and other regularised subproblems in optimization. Mathematical Programming Computations, 2 (1):21 57, 2010.

Grapiglia, G. N. and Nesterov, Y. Regularized Newton methods for minimizing functions with H older continuous Hessians. SIAM Journal on Optimization, 27(1): 478 506, 2017.

Griewank, A. The modiﬁcation of Newton s method for unconstrained optimization by bounding cubic terms. Technical report, Technical Report NA/12, Department of Applied Mathematics and Theoretical Physics, University of Cambridge, 1981.

Kohler, J. M. and Lucchi, A. Sub-sampled cubic regularization for non-convex optimization. ar Xiv preprint ar Xiv:1705.05933, 2017.

Mutn y, M. and Richt arik, P. Parallel stochastic newton method. Journal of Computational Mathematics, 36(3), 2018.

Nesterov, Y. Introductory lectures on convex optimization., 2004.

Nesterov, Y. Modiﬁed gauss newton scheme with worst case guarantees for global performance. Optimisation Methods and Software, 22(3):469 483, 2007.

Nesterov, Y. Accelerating the cubic regularization of Newton s method on convex problems. Mathematical Programming, 112(1):159 181, 2008.

Nesterov, Y. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125 161, 2013.

Nesterov, Y. and Polyak, B. T. Cubic regularization of Newton s method and its global performance. Mathematical Programming, 108(1):177 205, 2006.

Pilanci, M. and Wainwright, M. J. Randomized sketches of convex programs with sharp guarantees. IEEE Transactions on Information Theory, 61(9):5096 5115, 2015.

Qu, Z. and Richt arik, P. Coordinate descent with arbitrary sampling II: expected separable overapproximation. Optimization Methods and Software, 31(5):858 884, 2016.

Qu, Z., Richt arik, P., Tak aˇc, M., and Fercoq, O. SDNA: stochastic dual Newton ascent for empirical risk minimization. In International Conference on Machine Learning, pp. 1823 1832, 2016.

Richt arik, P. and Tak aˇc, M. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144 (1-2):1 38, 2014.

Richt arik, P. and Tak aˇc, M. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(1-2):433 484, 2016.

Rockafellar, R. T. Convex analysis. Princeton landmarks in mathematics, 1997.

Shalev-Shwartz, S. and Zhang, T. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567 599, 2013.

Tripuraneni, N., Stern, M., Jin, C., Regier, J., and Jordan, M. I. Stochastic cubic regularization for fast nonconvex optimization. ar Xiv preprint ar Xiv:1711.02838, 2017.

Tseng, P. and Yun, S. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117(1-2):387 423, 2009.