# approximate_newton_methods_and_their_local_convergence__0fa92529.pdf

Approximate Newton Methods and Their Local Convergence

Haishan Ye 1 Luo Luo 1 Zhihua Zhang 2

Abstract Many machine learning models are reformulated as optimization problems. Thus, it is important to solve a large-scale optimization problem in big data applications. Recently, stochastic second order methods have emerged to attract much attention for optimization due to their efﬁciency at each iteration, rectiﬁed a weakness in the ordinary Newton method of suffering a high cost in each iteration while commanding a high convergence rate. However, the convergence properties of these methods are still not well understood. There are also several important gaps between the current convergence theory and the performance in real applications. In this paper, we aim to ﬁll these gaps. We propose a unifying framework to analyze local convergence properties of second order methods. Based on this framework, our theoretical analysis matches the performance in real applications.

1. Introduction

Mathematical optimization is an importance pillar of machine learning. We consider the following optimization problem

min x Rd F(x) 1

i=1 fi(x), (1)

where the fi(x) are smooth functions. Many machine learning models can be expressed as (1) where each fi is the loss with respect to (w.r.t.) the i-th training sample. There are many examples such as logistic regressions, smoothed support vector machines, neural networks, and graphical models.

Many optimization algorithms to solve the problem in (1) are based on the following iteration:

x(t+1) = x(t) ηt Qtg(x(t)), t = 0, 1, 2, . . . ,

1Shanghai Jiao Tong University, Shanghai, China 2Peking University & Beijing Institute of Big Data Research, Beijing, China. Correspondence to: Zhihua Zhang <zhzhang@gmail.com>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

where ηt > 0 is the step length. If Qt is the identity matrix and g(x(t)) = F(x(t)), the resulting procedure is called Gradient Descent (GD) which achieves sublinear convergence for a general smooth convex objective function and linear convergence for a smooth-strongly convex objective function. When n is large, the full gradient method is inefﬁcient due to its iteration cost scaling linearly in n. Consequently, stochastic gradient descent (SGD) has been a typical alternative (Robbins & Monro, 1951; Li et al., 2014; Cotter et al., 2011). In order to achieve cheaper cost in each iteration, such a method constructs an approximate gradient on a small mini-batch of data. However, the convergence rate can be signiﬁcantly slower than that of the full gradient methods (Nemirovski et al., 2009). Thus, a great deal of efforts have been made to devise modiﬁcation to achieve the convergence rate of the full gradient while keeping low iteration cost (Johnson & Zhang, 2013; Roux et al., 2012; Schmidt et al., 2013; Zhang et al., 2013).

If Qt is a d d positive deﬁnite matrix containing the curvature information, this formulation leads us to second-order methods. It is well known that second order methods enjoy superior convergence rate in both theory and practice in contrast to ﬁrst-order methods which only make use of the gradient information. The standard Newton method, where Qt = [ 2F(x(t))] 1, g(x(t)) = F(x(t)) and ηt = 1, achieves a quadratic convergence rate for smoothstrongly convex objective functions. However, the Newton method takes O(nd2 + d3) cost per iteration, so it becomes extremely expensive when n or d is very large. As a result, one tries to construct an approximation of the Hessian in which way the update is computationally feasible, and while keeping sufﬁcient second order information. One class of such methods are quasi-Newton methods, which are generalizations of the secant methods to ﬁnd the root of the ﬁrst derivative for multidimensional problems. The celebrated Broyden-Fletcher-Goldfarb-Shanno (BFGS) and its limited memory version (L-BFGS) are the most popular and widely used (Nocedal & Wright, 2006). They take O(nd + d2) cost per iteration.

Recently, when n d, so-called subsampled Newton methods have been proposed, which deﬁne an approximate Hessian matrix with a small subset of samples. The most naive approach is to sample a subset of functions fi randomly (Roosta-Khorasani & Mahoney, 2016; Byrd et al.,

Approximate Newton Methods and Their Local Convergence

2011; Xu et al., 2016) to construct a subsampled Hessian. Erdogdu & Montanari (2015) proposed a regularized subsampled Newton method called New Samp. When the Hessian can be written as 2F(x) = [B(x)]T B(x) where B(x) is an available n d matrix, Pilanci & Wainwright (2015) used sketching techniques to approximate the Hessian and proposed a sketch Newton method. Similarly, Xu et al. (2016) proposed to sample rows of B(x) with non-uniform probability distribution. Agarwal et al. (2016) brought up an algorithm called Li SSA to approximate the inversion of Hessian directly.

Although the convergence performance of stochastic second order methods has been analyzed, the convergence properties are still not well understood. There are several important gaps lying between the convergence theory and real application.

The ﬁrst gap is the necessity of Lipschitz continuity of Hessian. In previous work, to achieve a linear-quadratic convergence rate, stochastic second order methods all assume that 2F(x) is Lipschitz continuous. However, in real applications without this assumption, they might also converge to the optimal point. For example, Erdogdu & Montanari (2015) used New Samp to successfully train smoothed-SVM in which the Hessian is not Lipschitz continuous.

The second gap is about the sketched size of sketch Newton methods. To obtain a linear convergence, the sketched size is O(dκ2) in (Pilanci & Wainwright, 2015) and then is improved to O(dκ) in (Xu et al., 2016) using Gaussian sketching matrices, where κ is the condition number of the Hessian matrix in question. However, the sketch Newton empirically performs well even when the Hessian matrix is ill-conditioned. Sketched size being several tens of times or even several times of d can achieve a linear convergence rate in unconstrained optimization. But the theoretical result of Pilanci & Wainwright (2015); Xu et al. (2016) implies that sketched size may be beyond n in ill-condition cases.

The third gap is about the sample size in regularized subsampled Newton methods. In both (Erdogdu & Montanari, 2015) and (Roosta-Khorasani & Mahoney, 2016), their theoretical analysis shows that the sample size of regularized subsampled Newton methods should be set as the same as the conventional subsampled Newton method. In practice, however, adding a large regularizer can obviously reduce the sample size while keeping convergence. Thus, this contradicts the extant theoretical analysis (Erdogdu & Montanari, 2015; Roosta-Khorasani & Mahoney, 2016).

In this paper, we aim to ﬁll these gaps between the current theory and empirical performance. More speciﬁcally, we ﬁrst cast these second order methods into an algorith-

mic framework that we call approximate Newton. Then we propose a general result for analysis of local convergence properties of second order methods. Based on this framework, we give detailed theoretical analysis which matches the empirical performance very well. We summarize our contribution as follows:

We propose a unifying framework (Theorem 3) to analyze local convergence properties of second order methods including stochastic and deterministic versions. The convergence performance of second order methods can be analyzed easily and systematically in this framework.

We prove that the Lipschitz continuity condition of Hessian is not necessary for achieving linear and superlinear convergence in variants of subsampled Newton. But it is needed to obtain quadratic convergence. This explains the phenomenon that New Samp (Erdogdu & Montanari, 2015) can be used to train smoothed SVM in which the Lipschitz continuity condition of Hessian is not satisﬁed. It also reveals the reason why previous stochastic second order methods, such as subsampled Newton, sketch Newton, Li SSA, etc., all achieve a linear-quadratic convergence rate.

We prove that the sketched size is independent of the condition number of the Hessian matrix which explains that sketched Newton performs well even when the Hessian matrix is ill-conditioned.

We provide a theoretical guarantee that adding a regularizer is an effective way to reduce the sample size in subsampled Newton methods while keeping converging. Our theoretical analysis also shows that adding a regularizer will lead to poor convergence behavior as the sample size decreases.

1.1. Organization

The remainder of the paper is organized as follows. In Section 2 we present notation and preliminaries. In Section 3 we present a unifying framework for local convergence analysis of second order methods. In Section 4 we analyze the local convergence properties of sketch Newton methods and prove that sketched size is independent of condition number of the Hessian. In Section 5 we give the local convergence behaviors of several variants of subsampled Newton method. Especially, we reveal the relationship among the sample size, regularizer and convergence rate. In Section 6, we derive the local convergence properties of inexact Newton methods from our framework. In Section 7, we validate our theoretical results experimentally. Finally, we conclude our work in Section 8. All the proofs are presented in the supplementary metarials.

Approximate Newton Methods and Their Local Convergence

2. Notation and Preliminaries

In this section, we introduce the notation and preliminaries that will be used in this paper.

2.1. Notation

Given a matrix A = [aij] Rm n of rank ℓand a positive integer k ℓ, its condensed SVD is given as A = UΣV T = UkΣk V T k +U\kΣ\k V T \k, where Uk and U\k contain the left singular vectors of A, Vk and V\k contain the right singular vectors of A, and Σ = diag(σ1, . . . , σℓ) with σ1 σ2 σℓ> 0 are the nonzero singular values of A. We use σmax(A) to denote the largest singular value and σmin(A) to denote the smallest non-zero singular value. Thus, the condition number of A is deﬁned by κ(A) σmax(A)

σmin(A) . If A is positive semideﬁnite, then

U = V and the square root of A can be deﬁned as A1/2 = UΣ1/2U T . It also holds that λi(A) = σi(A), where λi(A) is the i-th largest eigenvalue of A, λmax(A) = σmax(A), and λmin(A) = σmin(A).

Additionally, A σ1 is the spectral norm. Given a positive deﬁnite matrix M, x M M 1/2x is called the M-norm of x. Give square matrices A and B with the same size, we denote A B if B A is positive semideﬁnite.

2.2. Randomized sketching matrices

We ﬁrst give an ϵ-subspace embedding property which will be used to sketch Hessian matrices. Then we list two most popular types of randomized sketching matrices.

Deﬁnition 1 S Rs m is said to be an ϵ-subspace embedding matrix w.r.t. a ﬁxed matrix A Rm d where d < m, if SAx 2 = (1 ϵ) Ax 2 (i.e., (1 ϵ) Ax 2 SAx 2 (1 + ϵ) Ax 2) for all x Rd.

From the deﬁnition of the ϵ-subspace embedding matrix, we can derive the following property directly.

Lemma 2 S Rs m is an ϵ-subspace embedding matrix w.r.t. the matrix A Rm d if and only if

(1 ϵ)AT A AT ST SA (1 + ϵ)AT A.

Leverage score sketching matrix. A leverage score sketching matrix S = DΩ Rs m w.r.t. A Rm d is deﬁned by sampling probabilities pi, a sampling matrix Ω Rm s and a diagonal rescaling matrix D Rs s. Specifically, we construct S as follows. For every j = 1, . . . , s, independently and with replacement, pick an index i from the set {1, 2 . . . , m} with probability pi, and set Ωji = 1 and Ωjk = 0 for k = i as well as Djj = 1/ pis. The sampling probabilities pi are the leverage scores of A deﬁned as follows. Let V Rm d be the column orthonormal

basis of A, and let vi, denote the i-th row of V . Then ℓi vi, 2/d for i = 1, . . . , m are the leverage scores of A. To achieve an ϵ-subspace embedding property w.r.t. A, s = O(d log d/ϵ2) is sufﬁcient.

Sparse embedding matrix. A sparse embedding matrix S Rs m is such a matrix in each column of which there is only one nonzero entry uniformly sampled from {1, 1} (Clarkson & Woodruff, 2013). Hence, it is very efﬁcient to compute SA, especially when A is sparse. To achieve an ϵ-subspace embedding property w.r.t. A Rm d, s = O(d2/ϵ2) is sufﬁcient (Meng & Mahoney, 2013; Woodruff, 2014).

Other sketching matrices such as Gaussian Random Projection and Subsampled Randomized Hadamard Transformation as well as their properties can be found in the survey (Woodruff, 2014).

2.3. Assumptions and Notions

In this paper, we focus on the problem described in Eqn. (1). Moreover, we will make the following two assumptions.

Assumption 1 The objective function F is µ-strongly convex, that is,

F(y) F(x)+[ F(x)]T (y x)+ µ

2 y x 2, for µ > 0.

Assumption 2 F(x) is L-Lipschitz continuous, that is,

F(x) F(y) L y x , for L > 0.

Assumptions 1 and 2 imply that for any x Rd, we have

µI 2F(x) LI,

where I is the identity matrix of appropriate size. With a little confusion, we deﬁne κ L

µ . In fact, κ is an upper bound of condition number of the Hessian matrix 2F(x) for any x.

Besides, if 2F(x) is Lipschitz continuous, then we have

2F(x) 2F(y) ˆL x y ,

where ˆL > 0 is the Lipschitz constant of 2F(x).

Throughout this paper, we use notions of linear convergence rate, superlinear convergence rate and quadratic convergence rate. In our paper, the convergence rates we will use are deﬁned w.r.t. M , where M = [ 2F(x )] 1

and x is the optimal solution to Problem (1). A sequence of vectors {x(t)} is said to converge linearly to a limit point x , if for some 0 < ρ < 1,

lim sup t F(x(t+1)) M

F(x(t)) M = ρ.

Approximate Newton Methods and Their Local Convergence

Similarly, superlinear convergence and quadratic convergence are respectively deﬁned as

lim sup t F(x(t+1)) M

F(x(t)) M = 0, lim sup t F(x(t+1)) M

F(x(t)) 2 M = ρ.

We call it the linear-quadratic convergence rate if the following condition holds:

F(x(t+1)) M ρ1 F(x(t)) M +ρ2 F(x(t)) 2 M ,

where 0 < ρ1 < 1.

3. Approximate Newton Methods and Local Convergence Analysis

The existing variants of stochastic second order methods share some important attributes. First, these methods such as New Samp (Erdogdu & Montanari, 2015), Li SSA (Agarwal et al., 2016), subsampled Newton with conjugate gradient (Byrd et al., 2011), and subsampled Newton with nonuniformly sampling (Xu et al., 2016), all have the same convergence properties; that is, they have a linear-quadratic convergence rate.

Second, they also enjoy the same algorithm procedure summarized as follows. In each iteration, they ﬁrst construct an approximate Hessian matrix H(t) such that

(1 ϵ0)H(t) 2F(x(t)) (1 + ϵ0)H(t), (2)

where 0 ϵ0 < 1. Then they solve the following optimization problem

min p 1 2p T H(t)p p T F(x(t)) (3)

approximately or exactly to obtain the direction vector p(t). Finally, their update equation is given as x(t+1) = x(t) p(t). With this procedure, we regard these stochastic second order methods as approximate Newton methods.

In the following theorem, we propose a unifying framework which describes the convergence properties of the second order optimization procedure depicted above.

Theorem 3 Let Assumptions 1 and 2 hold. Suppose that 2F(x) exists and is continuous in a neighborhood of a minimizer x . H(t) is a positive deﬁnite matrix that satisﬁes Eqn. (2) with 0 ϵ0 < 1. Let p(t) be an approximate solution of Problem (3) such that

F(x(t)) H(t)p(t) ϵ1

κ F(x(t)) , (4)

where 0 < ϵ1 < 1. Deﬁne the iteration x(t+1) = x(t) p(t).

(a) There exists a sufﬁcient small value γ, 0 < ν(t) < 1, and 0 < η(t) < 1 such that when x(t) x γ, we have that

F(x(t+1)) M

ϵ0 + ϵ1 1 ϵ0 + 2η(t)

1 ν(t) F(x(t)) M .

Besides, ν(t) and η(t) will go to 0 as x(t) goes to x .

(b) Furthermore, if 2F(x) is Lipschitz continuous with parameter ˆL, and x(t) satisﬁes

ˆLκ ν(t), (6)

where 0 < ν(t) < 1, then it holds that

F(x(t+1)) M

ϵ0 + ϵ1 1 ϵ0

1 ν(t) F(x(t)) M

+ 2 (1 ϵ0)2 ˆLκ µ µ (1 + ν(t))2

1 ν(t) F(x(t)) 2 M . (7)

From Theorem 3, we can ﬁnd some important insights. First, Theorem 3 provides sufﬁcient conditions to get different convergence rates including super-liner and quadratic convergence rates. If ϵ0 + ϵ1 1 ϵ0

is a constant,

then sequence {x(t)} converges linearly because ν(t) and η(t) will go to 0 as t goes to inﬁnity. If we set ϵ0 = ϵ0(t) and ϵ1 = ϵ1(t) such that ϵ0(t) and ϵ1(t) decrease to 0 as t increases, then sequence {x(t)} will converge superlinearly. Similarly, if ϵ0(t) = O( F(x(t)) M ), ϵ1(t) = O( F(x(t)) M ), and 2F(x) is Lipschitz continuous, then sequence {x(t)} will converge quadratically.

Second, Theorem 3 makes it clear that the Lipschitz continuity of 2F(x) is not necessary for linear convergence and superlinear convergence of stochastic second order methods including Subsampled Newton method, Sketch Newton, New Samp, etc. This reveals the reason why New Samp can be used to train the smoothed SVM where the Lipschitz continuity of the Hessian matrix is not satisﬁed. The Lipschitz continuity condition is only needed to get a quadratic convergence or linear-quadratic convergence. This explains the phenomena that Li SSA(Agarwal et al., 2016), New Samp (Erdogdu & Montanari, 2015), subsampled Newton with non-uniformly sampling (Xu et al., 2016), Sketched Newton (Pilanci & Wainwright, 2015) have linear-quadratic convergence rate because they all assume that the Hessian is Lipschitz continuous. In fact, it is well known that the Lipschitz continuity condition of 2F(x) is not necessary to achieve a linear or superlinear convergence rate for inexact Newton methods.

Approximate Newton Methods and Their Local Convergence

Algorithm 1 Sketch Newton.

1: Input: x(0), 0 < δ < 1, 0 < ϵ0 < 1; 2: for t = 0, 1, . . . until termination do 3: Construct an ϵ0-subspace embedding matrix S for B(x(t)) and where 2F(x) is of the form 2F(x) = (B(x(t)))T B(x(t)), and calculate H(t) = [B(x(t))]T ST SB(x(t)); 4: Calculate p(t) argminp 1 2p T H(t)p p T F(x(t)); 5: Update x(t+1) = x(t) p(t); 6: end for

Third, the unifying framework of Theorem 3 contains not only stochastic second order methods, but also the deterministic versions. For example, letting H(t) = 2F(x(t)) and using conjugate gradient to get p(t), we obtain the famous Newton-CG method. In fact, different choice of H(t) and different way to calculate p(t) lead us to different second order methods. In the following sections, we will use this framework to analyze the local convergence performance of these second order methods in detail.

4. Sketch Newton Method

In this section, we use Theorem 3 to analyze the local convergence properties of Sketch Newton (Algorithm 1). We mainly focus on the case that the Hessian matrix is of the form

2F(x) = B(x)T B(x) (8)

where B(x) is an explicitly available n d matrix. Our result can be easily extended to the case that

2F(x) = B(x)T B(x) + Q(x),

where Q(x) is a positive semi-deﬁnite matrix related to the Hessian of regularizer.

Theorem 4 Let F(x) satisfy the conditions described in Theorem 3. Assume the Hessian matrix is given as Eqn. (8). Let 0 < δ < 1, 0 < ϵ0 < 1/2 and 0 ϵ1 < 1 be given. S Rℓ n is an ϵ0-subspace embedding matrix w.r.t. B(x) with probability at least 1 δ, and direction vector p(t) satisﬁes Eqn. (4). Then Algorithm 1 has the following convergence properties:

(a) There exists a sufﬁcient small value γ, 0 < ν(t) < 1, and 0 < η(t) < 1 such that when x(t) x γ, then each iteration satisﬁes Eqn. (5) with probability at least 1 δ.

(b) If 2F(x(t)) is also Lipschitz continuous and {x(t)} satisﬁes Eqn. (6), then each iteration satisﬁes Eqn. (7) with probability at least 1 δ.

Table 1. Comparison with previous work

Reference Sketched Size

Pilanci & Wainwright (2015) O dκ2 log d

Xu et al. (2016) O dκ log d

Our result(Theorem 4) O d log d

Theorem 4 directly provides a bound of the sketched size. Using the leverage score sketching matrix as an example, the sketched size ℓ= O(d log d/ϵ2 0) is sufﬁcient. We compare our theoretical bound of the sketched size with the ones of Pilanci & Wainwright (2015) and Xu et al. (2016) in Table 1. As we can see, our sketched size is much smaller than the other two, especially when the Hessian matrix is ill-conditioned.

Theorem 4 shows that the sketched size ℓis independent on the condition number of the Hessian matrix 2F(x) just as shown in Table 1. This explains the phenomena that when the Hessian matrix is ill-conditioned, Sketch Newton performs well even when the sketched size is only several times of d. For a large condition number, the theoretical bounds of both Xu et al. (2016) and Pilanci & Wainwright (2015) may be beyond the number of samples n. Note that the theoretical results of (Xu et al., 2016) and (Pilanci & Wainwright, 2015) still hold in the constrained optimization problem. However, our result proves the effectiveness of the sketch Newton method for the unconstrained optimization problem in the ill-conditioned case.

5. The Subsampled Newton method and Variants

In this section, we apply Theorem 3 to analyze Subsampled Newton and regularized subsampled Newton method.

First, we make the assumption that each fi(x) and F(x) have the following properties:

max 1 i n 2fi(x) K < , (9)

λmin( 2F(x)) σ > 0. (10)

Accordingly, if 2F(x) is ill-conditioned, then the value K

σ is large.

5.1. The Subsampled Newton method

The Subsampled Newton method is depicted in Algorithm 2, and we now give its local convergence properties in the following theorem.

Theorem 5 Let F(x) satisfy the properties described in Theorem 3. Assume Eqn. (9) and Eqn. (10) hold and let 0 < δ < 1, 0 < ϵ0 < 1/2 and 0 ϵ1 < 1 be given. |S| and H(t) are set as in Algorithm 2, and the direction vector p(t) satisﬁes Eqn. (4). Then Algorithm 2 has the following

Approximate Newton Methods and Their Local Convergence

Algorithm 2 Subsampled Newton.

1: Input: x(0), 0 < δ < 1, 0 < ϵ0 < 1; 2: Set the sample size |S| 16K2 log(2d/δ)

σ2ϵ2 0 . 3: for t = 0, 1, . . . until termination do 4: Select a sample set S, of size |S| and construct H(t) = 1 |S| P

j S 2fj(x(t));

5: Calculate p(t) argminp 1 2p T H(t)p p T F(x(t)); 6: Update x(t+1) = x(t) p(t); 7: end for

convergence properties:

(a) There exists a sufﬁcient small value γ, 0 < ν(t) < 1, and 0 < η(t) < 1 such that when x(t) x γ, then each iteration satisﬁes Eqn. (5) with probability at least 1 δ.

(b) If 2F(x(t)) is also Lipschitz continuous and {x(t)} satisﬁes Eqn. (6), then each iteration satisﬁes Eqn. (7) with probability at least 1 δ.

As we can see, Algorithm 2 almost has the same convergence properties as Algorithm 1 except several minor differences. The main difference is the construction manner of H(t) which should satisfy Eqn. (2). Algorithm 2 relies on the assumption that each 2fi(x) is upper bounded (i.e., Eqn. (9) holds), while Algorithm 1 is built on the setting of the Hessian matrix as in Eqn. (8).

5.2. Regularized Subsampled Newton

In ill-conditioned cases (i.e., K

σ is large), the subsampled Newton method in Algorithm 2 should take a lot of samples because the sample size |S| depends on K

σ quadratically. To overcome this problem, one resorts to a regularized subsampled Newton method. The key idea is to add αI to the original subsampled Hessian just as described in Algorithm 3. Erdogdu & Montanari (2015) proposed New Samp which is another regularized subsampled Newton method depicted in Algorithm 4. In the following analysis, we prove that adding a regularizer is an effective way to reduce the sample size while keeping converging in theory.

We ﬁrst give the theoretical analysis of local convergence properties of Algorithm 3.

Theorem 6 Let F(x) satisfy the properties described in Theorem 3. Assume Eqns. (9) and (10) hold, and let 0 < δ < 1, 0 ϵ1 < 1 and 0 < α be given. Assume β is a constant such that 0 < β < α + σ

2 , the subsampled size

|S| satisﬁes |S| 16K2 log(2d/δ)

β2 , and H(t) is constructed

as in Algorithm 3. Deﬁne

ϵ0 = max β α σ + α β , α + β σ + α + β

which implies that 0 < ϵ0 < 1. Besides, the direction vector p(t) satisﬁes Eqn. (4). Then Algorithm 3 has the following convergence properties:

(a) There exists a sufﬁcient small value γ, 0 < ν(t) < 1, and 0 < η(t) < 1 such that when x(t) x γ, each iteration satisﬁes Eqn. (5) with probability at least 1 δ.

(b) If 2F(x(t)) is also Lipschitz continuous and {x(t)} satisﬁes Eqn. (6), then each iteration satisﬁes Eqn. (7) with probability at least 1 δ.

In Theorem 6 the parameter ϵ0 mainly decides convergence properties of Algorithm 3. It is determined by two terms just as shown in Eqn. (11). These two terms depict the relationship among the sample size, regularizer αI, and convergence rate.

The ﬁrst term describes the relationship between the regularizer αI and sample size. Without loss of generality, we set β = α which satisﬁes 0 < β < α + σ/2. Then the sample size |S| = 16K2 log(2d/δ)

α2 decreases as α increases. Hence Theorem 6 gives a theoretical guarantee that adding the regularizer αI is an effective approach for reducing the sample size when K/σ is large. Conversely, if we want to sample a small part of fi s, then we should choose a large α. Otherwise, β will go to α + σ/2 which means ϵ0 = 1, i.e., the sequence {x(t)} does not converge.

Though a large α can reduce the sample size, it is at the expense of slower convergence rate just as the second term shows. As we can see, α+β σ+α+β goes to 1 as α increases. Besides, ϵ1 also has to decrease. Otherwise, ϵ0 + ϵ1 1 ϵ0 may be beyond 1 which means that Algorithm 3 will not converge.

In fact, slower convergence rate via adding a regularizer is because the sample size becomes small, which implies less curvature information is obtained. However, a small sample size implies low computational cost in each iteration. Therefore, a proper regularizer which balances the cost of each iteration and convergence rate is the key in the regularized subsampled Newton algorithm.

Next, we give the theoretical analysis of local convergence properties of New Samp (Algorithm 4).

Theorem 7 Let F(x) satisfy the properties described in Theorem 3. Assume Eqn. (9) and Eqn. (10) hold and let 0 < δ < 1 and target rank r be given. Let β be a con-

stant such that 0 < β < λ(t) r+1

2 , where λ(t) r+1 is the (r + 1)-th

Approximate Newton Methods and Their Local Convergence

Algorithm 3 Regularized Subsample Newton.

1: Input: x(0), 0 < δ < 1, regularizer parameter α, sample size |S| ; 2: for t = 0, 1, . . . until termination do 3: Select a sample set S, of size |S| and construct H(t) = 1 |S| P

j S 2fj(x(t)) + αI;

4: Calculate p(t) argminp 1 2p T H(t)p p T F(x(t)) 5: Update x(t+1) = x(t) p(t); 6: end for

Algorithm 4 New Samp.

1: Input: x(0), 0 < δ < 1, r, sample size |S|; 2: for t = 0, 1, . . . until termination do 3: Select a sample set S, of size |S| and get H(t) |S| =

j S 2fj(x(t));

4: Compute rank r + 1 truncated SVD deompostion of H(t) |S| to get Ur+1 and ˆΛr+1. Construct H(t) = H(t) |S| +

U\r(ˆλ(t) r+1I ˆΛ\r)U T \r 5: Calculate p(t) argminp 1 2p T H(t)p p T F(x(t)) 6: Update x(t+1) = x(t) p(t); 7: end for

eigenvalue of 2F(x(t)). Set the subsampled size |S| such that |S| 16K2 log(2d/δ)

β2 , and deﬁne

λ(t) r+1 β , 2β + λ(t) r+1 σ + 2β + λ(t) r+1

which implies 0 < ϵ0 < 1. Assume the direction vector p(t) satisﬁes Eqn. (4). Then Algorithm 4 has the following convergence properties:

(a) There exists a sufﬁcient small value γ, 0 < ν(t) < 1, and 0 < η(t) < 1 such that when x(t) x γ, each iteration satisﬁes Eqn. (5) with probability at least 1 δ.

(b) If 2F(x(t)) is also Lipschitz continuous and {x(t)} satisﬁes Eqn. (6), then each iteration satisﬁes Eqn. (7) with probability at least 1 δ.

Similar to Theorem 6, parameter ϵ0 in New Samp is also determined by two terms. The ﬁrst term reveals the the relationship between the target rank r and sample size. Without loss of generality, we can set β = λ(t) r+1/4. Then the sample size is linear in 1/[λ(t) r+1]2. Hence, a small r means that a small sample size is sufﬁcient. Conversely, if we want to sample a small portion of fi s, then we should choose a small r. Otherwise, β will go to λ(t) r+1/2 which means ϵ0 = 1, i.e., the sequence {x(t)} does not converge. The second term shows that a small sample size will lead to a poor convergence rate. If we set r = 0 and β = λ1/2, then

ϵ0 will be 1 1 1+2λ1/σ. Consequently, the convergence rate of New Samp is almost the same as gradient descent. Similar to Algorithm 3, a small r means a precise solution to Problem (3) and the initial point x(0) being close to the optimal point x .

It is worth pointing out that Theorem 7 explains the empirical results that New Samp is applicable in training SVM in which the Lipschitz continuity condition of 2F(x) is not satisﬁed (Erdogdu & Montanari, 2015).

We now conduct comparison between Theorem 6 and Theorem 7. We mainly focus on the parameter ϵ0 in these two theorems which mainly determines convergence properties of Algorithm 3 and Algorithm 4. Speciﬁcally, if we

set α = β + λ(t) r+1 in Eqn. (11), then ϵ0 = 2β+λ(t) r+1 σ+2β+λ(t) r+1 which equals to the second term on the right-hand side in Eqn. (12). Hence, we can regard New Samp as a special case of Algorithm 3. However, New Samp provides an approach for automatical choice of α.

Recall that New Samp includes another parameter: the target rank r. Thus, New Samp and Algorithm 3 have the same number of free parameters. If r is not properly chosen, New Samp will still have poor performance. Therefore, Algorithm 3 is theoretically preferred because New Samp needs extra cost to perform SVDs.

6. Inexact Newton Methods

Let H(t) = 2F(x(t)), that is, ϵ0 = 0. Then Theorem 3 depicts the convergence properties of inexact Newton methods.

Theorem 8 Let F(x) satisfy the properties described in Theorem 3, and p(t) be a direction vector such that

F(x(t)) 2F(x(t))p(t) ϵ1

κ F(x(t)) ,

where 0 < ϵ1 < 1. Consider the iteration x(t+1) = x(t) p(t).

(a) There exists a sufﬁcient small value γ, 0 < ν(t) < 1, and 0 < η(t) < 1 such that when x(t) x γ, then it holds that

F(x(t+1)) M (ϵ1+2η(t))1 + ν(t)

1 ν(t) F(x(t)) M .

(b) If 2F(x) is also Lipschitz continuous with parameter ˆL, and {x(t)} satisﬁes Eqn. (6), then it holds that

F(x(t+1)) M ϵ1 1 + ν(t) 1 ν(t) F(x(t)) M +

2ˆLκ µ µ (1 + ν(t))2

1 ν(t) F(x(t)) 2 M .

Approximate Newton Methods and Their Local Convergence

7. Empirical Study

In this section, we validate our theoretical results about sketched size of the sketch Newton, and sample size of regularized Newton, experimentally. Experiments for validating unnecessity of the Lipschitz continuity condition of 2F(x) are given in the supplementary materials.

7.1. Sketched Size of Sketch Newton

Now we validate that our theoretical result that sketched size is independent of the condition number of the Hessian in Sketch Newton. To control the condition number of the Hessian conveniently, we conduct the experiment on least squares regression which is deﬁned as

min x 1 2 Ax b 2. (13)

In each iteration, the Hessian matrix is AT A. In our experiment, A is a 10000 54 matrix. We set the singular values σi of A as: σi = 1.2 i. Then the condition number of A is κ(A) = 1.254 = 1.8741 104. We use different sketch matrices in Sketch Newton (Algorithm 1) and set different values of the sketched size ℓ. We report our empirical results in Figure 1.

From Figure 1, we can see that Sketch Newton performs well when the sketch size ℓis several times of d for all different sketching matrices. Moreover, the corresponding algorithms converge linearly. This matches our theory that the sketched size is independent of the condition number of the Hessian matrix to achieve a linear convergence rate. In contrast, the theoretical result of (Xu et al., 2016) shows that the sketched size is ℓ= d κ(A) = 1.02 106 bigger than n = 104.

5 10 15 20 25 30 iteration

Levereage Score Sketching

ℓ= 8d ℓ= 10d ℓ= 15d

(a) Leverage Score Sampling.

5 10 15 20 25 30 35 40 45 50 iteration

Sparse Sketching

ℓ= 5d ℓ= 8d ℓ= 10d

(b) Sparse Sketching.

Figure 1. Convergence properties of different sketched sizes

7.2. Sample Size of Regularized Subsampled Newton

We also choose least squares regression deﬁned in Eqn. (13) in our experiment to validate the theory that adding a regularizer is an effective approach to reducing the sample size while keeping convergence in Subsampled Newton. Let A Rn d where n = 8000 and d = 5000. Hence Sketch Newton can not be used in this case because n and d are close to each other. In our experiment, we set different sample sizes |S|. For each |S| we choose different

regularizer terms α and different target ranks r. We report our results in Figure 2.

As we can see, if the sample size |S| is small, then we should choose a large α; otherwise, the algorithm will diverge. However, if the regularizer term α is too large, then the algorithm will converge slowly. Increasing the sample size and choosing a proper regularizer will improve convergence properties obviously. When |S| = 600, it only needs about 1200 iterations to obtain a precise solution while it needs about 8000 iterations when |S| = 100. Similarly, if the sample size |S| is small, then we should choose a small target rank. Otherwise New Samp may diverge. Also, if the target rank is not chosen properly, New Samp will have poor convergence properties. Furthermore, from Figure 2, we can see that the two algorithms have similar convergence properties. This validates the theoretical result that New Samp provides a method to choose α automatically. Our empirical analysis matches the theoretical analysis in Subsection 5.2 very well.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 iteration

α = 19000 α = 20000 α = 30000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 iteration

r = 20 r = 40 r = 55

(a) Sample Size |S| = 100

0 500 1000 1500 2000 30

α = 2500 α = 3000 α = 4000

500 1000 1500 2000 2500 3000 iteration

r = 20 r = 40 r = 55

(b) Sample Size |S| = 600

Figure 2. Convergence properties of Regularized Subsampled Newton and New Samp

8. Conclusion

In this paper, we have proposed a framework to analyze the local convergence properties of second order methods including stochastic and deterministic versions. This framework reveals some important convergence properties of the subsampled Newton method and sketch Newton method, which are unknown before. The most important thing is that our analysis lays the theoretical foundation of several important stochastic second order methods.

We believe that this framework might also provide some useful insights for developing new subsampled Newtontype algorithms. We would like to address this issue in future.

Approximate Newton Methods and Their Local Convergence

Acknowledgements

Ye has been supported by the National Natural Science Foundation of China (Grant No. 11426026, 61632017, 61173011) and a Project 985 grant of Shanghai Jiao Tong University. Luo and Zhang have been supported by he National Natural Science Foundation of China (No. 61572017), Natural Science Foundation of Shanghai City (No. 15ZR1424200), and Microsoft Research Asia Collaborative Research Award.

Agarwal, Naman, Bullins, Brian, and Hazan, Elad. Second order stochastic optimization in linear time. ar Xiv preprint ar Xiv:1602.03943, 2016.

Byrd, Richard H, Chin, Gillian M, Neveitt, Will, and Nocedal, Jorge. On the use of stochastic hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977 995, 2011.

Clarkson, Kenneth L and Woodruff, David P. Low rank approximation and regression in input sparsity time. In Proceedings of the forty-ﬁfth annual ACM symposium on Theory of computing, pp. 81 90. ACM, 2013.

Cotter, Andrew, Shamir, Ohad, Srebro, Nati, and Sridharan, Karthik. Better mini-batch algorithms via accelerated gradient methods. In Advances in neural information processing systems, pp. 1647 1655, 2011.

Erdogdu, Murat A and Montanari, Andrea. Convergence rates of sub-sampled newton methods. In Advances in Neural Information Processing Systems, pp. 3034 3042, 2015.

Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315 323, 2013.

Li, Mu, Zhang, Tong, Chen, Yuqiang, and Smola, Alexander J. Efﬁcient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 661 670. ACM, 2014.

Meng, Xiangrui and Mahoney, Michael W. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Proceedings of the forty-ﬁfth annual ACM symposium on Theory of computing, pp. 91 100. ACM, 2013.

Nemirovski, Arkadi, Juditsky, Anatoli, Lan, Guanghui, and Shapiro, Alexander. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009.

Nocedal, Jorge and Wright, Stephen. Numerical optimization. Springer Science & Business Media, 2006.

Pilanci, Mert and Wainwright, Martin J. Newton sketch: A linear-time optimization algorithm with linear-quadratic convergence. ar Xiv preprint ar Xiv:1505.02250, 2015.

Robbins, Herbert and Monro, Sutton. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951.

Roosta-Khorasani, Farbod and Mahoney, Michael W. Subsampled newton methods ii: Local convergence rates. ar Xiv preprint ar Xiv:1601.04738, 2016.

Roux, Nicolas L, Schmidt, Mark, and Bach, Francis R. A stochastic gradient method with an exponential convergence rate for ﬁnite training sets. In Advances in Neural Information Processing Systems, pp. 2663 2671, 2012.

Schmidt, Mark, Roux, Nicolas Le, and Bach, Francis. Minimizing ﬁnite sums with the stochastic average gradient. ar Xiv preprint ar Xiv:1309.2388, 2013.

Woodruff, David P. Sketching as a tool for numerical linear algebra. Foundations and Trends R in Theoretical Computer Science, 10(1 2):1 157, 2014.

Xu, Peng, Yang, Jiyan, Roosta-Khorasani, Farbod, R e, Christopher, and Mahoney, Michael W. Sub-sampled newton methods with non-uniform sampling. In Advances in Neural Information Processing Systems, pp. 3000 3008, 2016.

Zhang, Lijun, Mahdavi, Mehrdad, and Jin, Rong. Linear convergence with condition number independent access of full gradients. In Advance in Neural Information Processing Systems 26 (NIPS), pp. 980 988, 2013.