# approximate_newton_methods__6d74ee48.pdf

Journal of Machine Learning Research 22 (2021) 1-41 Submitted 10/19; Revised 1/21; Published 3/21

Approximate Newton Methods

Haishan Ye YEHAISHAN@XJTU.EDU.CN Center for Intelligent Decision-Making and Machine Learning School of Management Xi an Jiaotong University Xi an, China

Luo Luo LUOLUO@UST.HK Department of Mathematics Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong

Zhihua Zhang ZHZHANG@MATH.PKU.EDU.CN School of Mathematical Sciences Peking University 5 Yiheyuan Road, Beijing, China

Editor: Qiang Liu

Many machine learning models involve solving optimization problems. Thus, it is important to address a large-scale optimization problem in big data applications. Recently, subsampled Newton methods have emerged to attract much attention due to their efﬁciency at each iteration, rectiﬁed a weakness in the ordinary Newton method of suffering a high cost in each iteration while commanding a high convergence rate. Other efﬁcient stochastic second order methods have been also proposed. However, the convergence properties of these methods are still not well understood. There are also several important gaps between the current convergence theory and the empirical performance in real applications. In this paper, we aim to ﬁll these gaps. We propose a unifying framework to analyze both local and global convergence properties of second order methods. Accordingly, we present our theoretical results which match the empirical performance in real applications well. Keywords: Approximate Newton, Stochastic Second-order, Hessian Approximation

1. Introduction

Mathematical optimization is an important pillar of machine learning. We consider the following optimization problem:

min x Rd F(x) = 1

i=1 fi(x), (1)

where the fi(x) are smooth functions. Many machine learning models can be expressed as (1) where each fi is the loss with respect to (w.r.t.) the i-th training sample. There are many examples such as logistic regression, smoothed support vector machines, neural networks, and graphical models.

. Corresponding author.

c 2021 Haishan Ye, Luo Luo, and Zhihua Zhang.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/19-870.html.

YE, LUO, AND ZHANG

Many optimization algorithms to solve the problem in (1) are based on the following iteration:

x(t+1) = x(t) st Qtg(x(t)), t = 0, 1, 2, . . . ,

where st > 0 is the step length. If Qt is the identity matrix and g(x(t)) = F(x(t)), the resulting procedure is called Gradient Descent (GD) which achieves sublinear convergence for a general smooth convex objective function and linear convergence for a smooth-strongly convex objective function. When n is large, the full gradient method is inefﬁcient due to its iteration cost scaling linearly in n. Consequently, stochastic gradient descent (SGD) has been a typical alternative (Robbins and Monro, 1951; Li et al., 2014; Cotter et al., 2011). To achieve cheaper cost in each iteration, such a method constructs an approximate gradient on a small mini-batch of data. However, the convergence rate can be signiﬁcantly slower than that of the full gradient methods (Nemirovski et al., 2009). Thus, many efforts have been made to devise modiﬁcation to achieve the convergence rate of the full gradient while keeping low iteration cost (Johnson and Zhang, 2013; Roux et al., 2012; Schmidt et al., 2017; Zhang et al., 2013). If Qt is a d d positive deﬁnite matrix of containing the curvature information, this formulation leads us to second-order methods. It is well known that second order methods enjoy superior convergence rate in both theory and practice in contrast to ﬁrst-order methods which only make use of the gradient information. The standard Newton method, where Qt = [ 2F(x(t))] 1, g(x(t)) = F(x(t)) and st = 1, achieves a quadratic convergence rate for smooth-strongly convex objective functions. However, the Newton method takes O(nd2+d3) cost per iteration, so it becomes extremely expensive when n or d is very large. As a result, one tries to construct an approximation of the Hessian such that the update is computationally feasible while keeping sufﬁcient second order information. One class of such methods is quasi-Newton methods, which are generalization of the secant methods to ﬁnd the root of the ﬁrst derivative for multidimensional problems. The celebrated Broyden-Fletcher-Goldfarb-Shanno (BFGS) and its limited memory version (L-BFGS) are the most popular and widely used (Nocedal and Wright, 2006). They take O(nd + d2) cost per iteration. Recently, when n d, a class of called subsampled Newton methods have been proposed, which deﬁne an approximate Hessian matrix with a small subset of samples. The most naive approach is to sample a subset of functions fi randomly (Roosta-Khorasani and Mahoney, 2019; Byrd et al., 2011; Xu et al., 2016) to construct a subsampled Hessian. Erdogdu and Montanari (2015) proposed a regularized subsampled Newton method called New Samp. When the Hessian can be written as 2F(x) = [B(x)]T B(x) where B(x) is an available n d matrix, Pilanci and Wainwright (2017) used sketching techniques to approximate the Hessian and proposed sketch Newton method. Similarly, Xu et al. (2016) proposed to sample rows of B(x) with non-uniform probability distribution. Agarwal et al. (2017) brought up an algorithm called Li SSA to approximate the inverse of Hessian directly. Although the convergence performance of stochastic second order methods has been analyzed, the convergence properties are still not well understood. Several important gaps are lying between the convergence theory and the practical performance of these algorithms in real applications. First, it is about the necessity of Lipschitz continuity of the Hessian. In previous work, to achieve a linear-quadratic convergence rate, stochastic second order methods all assume that 2F(x) is Lipschitz continuous. However, in some scenarios without this assumption, they might also converge to an optimal point. For example, Erdogdu and Montanari (2015) used New Samp to successfully train the smoothed-SVM in which the ℓ2-hinge loss is used, so the corresponding Hessian is not Lipschitz continuous.

APPROXIMATE NEWTON METHODS

Second, it involves the sketching size of sketch Newton methods. To obtain a linear convergence, the sketching size is O(dκ2) in Pilanci and Wainwright (2017) and then improved to O(dκ) in Xu et al. (2016), where κ is the condition number of the Hessian matrix in question. Recently, Wang et al. (2018) extended the sketch Newton method to distributed optimization and achieved fast convergence rate and low communication cost. However, the sketch Newton empirically performs well even when the Hessian matrix is ill-conditioned. Sketching size being several tens of times, or even several times of d can achieve a linear convergence rate in unconstrained optimization. But the theoretical result of Pilanci and Wainwright (2017); Xu et al. (2016); Wang et al. (2018) implies that sketching size may be beyond n in ill-condition cases. Third, it talks about the sample size in regularized subsampled Newton methods. In both Erdogdu and Montanari (2015) and Roosta-Khorasani and Mahoney (2016), their theoretical analysis shows that the sample size of regularized subsampled Newton methods should be set as the same as the conventional subsampled Newton method. In practice, however, adding a large regularizer can obviously reduce the sample size while keeping convergence. Thus, this does not agree with the extant theoretical analysis (Erdogdu and Montanari, 2015; Roosta-Khorasani and Mahoney, 2016). In this paper, we aim to ﬁll these gaps between the current theory and empirical performance. More speciﬁcally, we ﬁrst cast these second order methods into an algorithmic framework that we call approximate Newton. Accordingly, we propose a general result for analysis of both local and global convergence properties of second order methods. Based on this framework, we then give a detailed theoretical analysis that matches the empirical performance. We summarize our contribution as follows:

We propose a unifying framework (Theorem 3 and Theorem 5) to analyze local and global convergence properties of second order methods including stochastic and deterministic versions. The convergence performance of second order methods can be analyzed easily and systematically in this framework.

We prove that the Lipschitz continuity condition of Hessian is not necessary for achieving linear and superlinear convergence in variants of subsampled Newton. But it is needed to obtain quadratic convergence. This explains the phenomenon that New Samp (Erdogdu and Montanari, 2015) can be used to train the smoothed SVM in which the Lipschitz continuity condition of Hessian is not satisﬁed. It also reveals the reason why previous stochastic second order methods, such as subsampled Newton, sketch Newton, Li SSA, etc., all achieve a linear-quadratic convergence rate.

We prove that the sketching size is independent of the condition number of the Hessian matrix which explains that sketched Newton performs well even when the Hessian matrix is ill-conditioned.

Based on our analysis framework, we provide a much tighter bound of the sample size of subsampled Newton methods. To our best knowledge, it is the tightest bound of subsampled Newton methods in the extant results.

We provide a theoretical guarantee for that adding a regularizer is an effective way to reduce sample size in subsampled Newton methods while keeping convergence. Our theoretical analysis also shows that adding a regularizer will lead to poor convergence behavior as the sample size decreases.

YE, LUO, AND ZHANG

The remainder of the paper is organized as follows. In Section 2 we present notation and preliminaries. In Section 3 we present a unifying framework for local and global convergence analysis of second order methods. In Section 4 we analyze the convergence properties of sketch Newton methods and prove that sketching size is independent of the condition number of the Hessian matrix. In Section 5 we give the convergence behaviors of several variants of subsampled Newton methods. Especially, we reveal the relationship among sample size, regularizer, and convergence rate. In Section 6, we validate our theoretical results experimentally. Finally, we conclude our work in Section 7. The proofs of theorems are given in the appendices.

2. Notation and Preliminaries

Section 2.1 deﬁnes the notation used in this paper. Section 2.2 introduces matrix sketching techniques and their properties. Section 2.3 describes some important assumptions about objective functions.

2.1 Notation

Given a matrix A = [aij] Rm n of rank ℓand a positive integer k ℓ, its condensed SVD is given as A = UΣV T = UkΣk V T k + U\kΣ\k V T \k, where Uk and U\k contain the left singular vectors of A, Vk and V\k contain the right singular vectors of A, and Σ = diag(σ1, . . . , σℓ) with σ1 σ2 σℓ> 0 contains the nonzero singular values of A. We let σmax(A) denote the largest singular value and σmin(A) denote the smallest non-zero singular value. Thus, the condition number of A is deﬁned by κ(A) σmax(A)

σmin(A) . If A is symmetric positive semi-deﬁnite, then U = V

and the square root of A can be deﬁned as A1/2 = UΣ1/2UT . It also holds that λi(A) = σi(A), where λi(A) is the i-th largest eigenvalue of A, λmax(A) = σmax(A), and λmin(A) = σmin(A). Additionally, A F (P i,j a2 ij)1/2 = (P i σ2 i )1/2 is the Frobenius norm of A and A σ1 is the spectral norm. Given a symmetric positive deﬁnite matrix M, x M M1/2x is called the M-norm of x. Given square matrices A and B with the same size, we denote A B if B A is positive semideﬁnite.

2.2 Randomized Sketching Matrices

We ﬁrst give an ϵ0-subspace embedding property which will be used to sketch Hessian matrices. Then we list some useful types of randomized sketching matrices including Gaussian projection (Halko et al., 2011; Johnson and Lindenstrauss, 1984), leverage score sampling (Drineas et al., 2006), count sketch (Clarkson and Woodruff, 2013; Nelson and Nguyˆen, 2013; Meng and Mahoney, 2013).

Deﬁnition 1 S Rℓ n is said to be an ϵ0-subspace embedding matrix w.r.t. a ﬁxed matrix A Rn d

where d < n, if SAx 2 = (1 ϵ0) Ax 2 (i.e., (1 ϵ0) Ax 2 SAx 2 (1 + ϵ0) Ax 2) for all x Rd.

From the deﬁnition of the ϵ0-subspace embedding matrix, we can derive the following property directly.

Lemma 2 S Rℓ n is an ϵ0-subspace embedding matrix w.r.t. the matrix A Rn d if and only if

(1 ϵ0)AT A AT ST SA (1 + ϵ0)AT A.

APPROXIMATE NEWTON METHODS

Gaussian sketching matrix. The most classical sketching matrix is the Gaussian sketching matrix S Rℓ n, whose entries are i.i.d. from the normal of mean 0 and variance 1/ℓ. Owing to the well-known concentration properties (Woodruff, 2014), Gaussian random matrices are very attractive. Note that ℓ= O(d/ϵ2 0) is enough to guarantee the ϵ0-subspace embedding property for any ﬁxed matrix A Rn d. Moreover, ℓ= O(d/ϵ2 0) is the tightest bound among known types of sketching matrices. However, the Gaussian random matrix is usually dense, so it is costly to compute SA.

Leverage score sampling matrix. A leverage score sampling matrix S = DΩ Rℓ n w.r.t. A Rn d is deﬁned by sampling probabilities pi, a sampling matrix Ω Rℓ n and a diagonal rescaling matrix D Rℓ ℓ. Speciﬁcally, we construct S as follows. For every j = 1, . . . , ℓ, independently and with replacement pick an index i from the set {1, 2 . . . , n} with probability pi, and set Ωji = 1 and Ωjk = 0 for k = i as well as Djj = 1/ piℓ. The sampling probabilities pi are the leverage scores of A deﬁned as follows. Let V Rn d be the column orthonormal basis of A, and let vi, denote the i-th row of V . Then qi vi, 2/d for i = 1, . . . , n are the leverage scores of A. To achieve an ϵ0-subspace embedding property w.r.t. A, ℓ= O(d log d/ϵ2 0) is sufﬁcient.

Sparse embedding matrix. A sparse embedding matrix S Rℓ n is such a matrix in each column of which there is only one nonzero entry uniformly sampled from {1, 1} (Clarkson and Woodruff, 2013). Hence, it is very efﬁcient to compute SA, especially when A is sparse. To achieve an ϵ0-subspace embedding property w.r.t. A Rn d, ℓ= O(d2/ϵ2 0) is sufﬁcient (Meng and Mahoney, 2013; Woodruff, 2014). Other sketching matrices such as Subsampled Randomized Hadamard Transformation (Drineas et al., 2012; Halko et al., 2011) as well as their properties can be found in the survey (Woodruff, 2014).

2.3 Assumptions and Notions

In this paper we focus on the problem described in Eqn. (1). Moreover, we will make the following two assumptions.

Assumption 1 The objective function F is µ-strongly convex, that is,

F(y) F(x) + [ F(x)]T (y x) + µ

2 y x 2, for µ > 0.

Assumption 2 F(x) is L-Lipschitz continuous, that is,

F(x) F(y) L y x , for L > 0.

Assumptions 1 and 2 imply that for any x Rd, we have

µI 2F(x) LI,

where I is the identity matrix of appropriate size. We deﬁne the condition number of the objective function of F(x) as

Note that κ is an upper bound of the condition number of the Hessian matrix 2F(x) for any x. Furthermore, if 2F(x) is Lipschitz continuous, then we have

2F(x) 2F(y) ˆL x y ,

YE, LUO, AND ZHANG

Algorithm 1 Approximate Newton.

1: Input: x(0), 0 < δ < 1, 0 < ϵ0 < 1; 2: for t = 0, 1, . . . until termination do 3: Construct an approximate Hessian H(t) satisfying Condition (2); 4: Calculate p(t) argminp 1 2p T H(t)p p T F(x(t));

5: Update x(t+1) = x(t) p(t); 6: end for

where ˆL > 0 is the Lipschitz constant of 2F(x). Throughout this paper, we use notions of linear convergence rate, superlinear convergence rate and quadratic convergence rate. In our paper, the convergence rates we will use are deﬁned w.r.t. M, where M = 2F(x ) and x is the optimal solution to Problem (1). A sequence of vectors {x(t)} is said to converge linearly to a limit point x , if for some 0 < ρ < 1,

lim sup t x(t+1) x M

x(t) x M = ρ.

Similarly, superlinear convergence and quadratic convergence are respectively deﬁned as

lim sup t x(t+1) x M

x(t) x M = 0, lim sup t x(t+1) x M

x(t) x 2 M = ρ.

We call it the linear-quadratic convergence rate if the following condition holds:

x(t+1) x M ρ1 x(t) x M + ρ2 x(t) x 2 M,

where 0 < ρ1 < 1 and 0 ρ2.

3. Main Results

The existing variants of stochastic second order methods share some important attributes. First, these methods such as New Samp (Erdogdu and Montanari, 2015), Li SSA (Agarwal et al., 2017), subsampled Newton with conjugate gradient (Byrd et al., 2011), and subsampled Newton with non-uniformly sampling (Xu et al., 2016), all have the same convergence properties; that is, they have a linear-quadratic convergence rate. Second, they also enjoy the same algorithm procedure summarized as follows. In each iteration, they ﬁrst construct an approximate Hessian matrix H(t) such that

(1 ϵ0)H(t) 2F(x(t)) (1 + ϵ0)H(t), (2)

where 0 ϵ0 < 1. Then they solve the following optimization problem

min p 1 2p T H(t)p p T F(x(t)) (3)

approximately or exactly to obtain the direction vector p(t). Finally, their update equation is given as x(t+1) = x(t) p(t). With this procedure, we regard these stochastic second order methods as approximate Newton methods. The detailed algorithmic description is listed in Algorithm 1.

APPROXIMATE NEWTON METHODS

3.1 Local Convergence Analysis

In the following theorem, we propose a unifying framework that describes the convergence properties of the second order optimization procedure depicted above.

Theorem 3 Let Assumptions 1 and 2 hold. Suppose that 2F(x) exists and is continuous in a neighborhood of a minimizer x . H(t) is a positive deﬁnite matrix that satisﬁes Eqn. (2) with 0 ϵ0 < 1. Let p(t) be an approximate solution of Problem (3) such that

F(x(t)) H(t)p(t) ϵ1 κ3/2 F(x(t)) , (4)

where 0 < ϵ1 < 1. Then Algorithm 1 has the following convergence properties. (a) There exists a sufﬁcient small value γ and ν = o(1) such that when x(t) x M γ, we have that x(t+1) x M ϵ0 + ϵ1 + 2νµ 1 + 2κµ 1ν(1 + νµ 1)

x(t) x M . (5)

Moreover, ν will go to 0 as x(t) goes to x . (b) Furthermore, if 2F(x) is ˆL-Lipschitz continuous, and x(t) satisﬁes

2µ 3 2 κ 1 ˆL 1, (6)

then it holds that x(t+1) x M (ϵ0 + ϵ1) x(t) x M + 4(κ + 1)µ 3/2 ˆL x(t) x 2

Remark 4 We give the convergence properties of the sequence x(t) x M in contrast to x(t) x used in works (Erdogdu and Montanari, 2015; Pilanci and Wainwright, 2017; Roosta Khorasani and Mahoney, 2019). Due to this difference, our unifying framework provides the foundation of many stronger theoretical results of approximate Newton methods which we will discuss in the next sections. For example, let us consider the case that F(x) is quadratic which implies ˆL = 0 and the Hessian is M = 2F(x) = AT A with A Rn d. By setting ϵ1 = 0, then Eqn. (7) reduces to x(t+1) x M ϵ0 x(t) x M. To achieve a linear convergence rate with 0 < ϵ0 < 1, we can construct an approximate Hessian H(t) = AT ST SA with S Rℓ n and ℓ= O(d/ϵ2 0) for Gaussian sketching matrix. In contrast, Eqn. (14) of Pilanci and Wainwright (2017) reduces to x(t+1) x ϵ 0κ x(t) x . To achieve a linear convergence rate of 0 < ϵ0 < 1, that is, ϵ 0κ ϵ0, then it requires ϵ 0 ϵ0κ 1. Combining with the properties of the Gaussian sketching matrix, the sketching size ℓis required to be ℓ= O(dκ2/ϵ2 0). We can observe the sketching size obtained by our theory is κ2 smaller than the one derived in Pilanci and Wainwright (2017).

From Theorem 3, we can ﬁnd some important insights. First, Theorem 3 provides sufﬁcient conditions to get different convergence rates including linear, super-liner, and quadratic rates. If (ϵ0 + ϵ1) is a constant less than 1, then sequence {x(t)} converges linearly with rate (ϵ0 + ϵ1). Furthermore, if we set ϵ0 = ϵ0(t) and ϵ1 = ϵ1(t) such that ϵ0(t) and ϵ1(t) decrease to 0 as t increases, then sequence {x(t)} will converge super-linearly. If 2F(x) is ˆL-Lipschitz and ϵ0 + ϵ1 = 0, then

YE, LUO, AND ZHANG

Algorithm 2 Approximate Newton with backtracking line search.

1: Input: x(0), 0 < α < 0.5, 0 < β < 1; 2: for t = 0, 1, . . . until termination do 3: Construct an approximate Hessian H(t) satisfying Condition (2); 4: Calculate p(t) argminp 1 2p T H(t)p p T F(x(t)); 5: Line search: 6: while F(x(t) + sp(t)) > F(x(t)) + αs[ F(x(t))]T p(t) do 7: s = βs 8: end while 9: Update x(t+1) = x(t) sp(t); 10: end for

{x(t)} converges quadratically. In this case, approximate Newton reduces to the exact Newton method. Second, Theorem 3 makes it clear that the Lipschitz continuity of the Hessian is not necessary for linear convergence and super-linear convergence of stochastic second order methods including Subsampled Newton method, Sketch Newton, New Samp, etc. This reveals the reason why New Samp can be used to train the smoothed SVM where the Lipschitz continuity of the Hessian matrix is not satisﬁed (Erdogdu and Montanari, 2015). The Lipschitz continuity condition is only needed to get a quadratic convergence or linear-quadratic convergence. This explains the phenomena that Li SSA (Agarwal et al., 2017), New Samp (Erdogdu and Montanari, 2015), Subsampled Newton with non-uniformly sampling (Xu et al., 2016), and Sketched Newton (Pilanci and Wainwright, 2017) have linear-quadratic convergence rate because they all assume that the Hessian is Lipschitz continuous. In fact, it is well known that the Lipschitz continuity condition of 2F(x) is not necessary to achieve a linear or superlinear convergence rate for inexact Newton methods. However, our work ﬁrst proves this result still holds for the approximate Hessian. Third, the unifying framework of Theorem 3 contains not only stochastic second order methods, but also the deterministic versions. For example, letting H(t) = 2F(x(t)) and using conjugate gradient to get p(t), we obtain the famous Newton-CG method. We can observe that by different choices of H(t) and different ways to calculate p(t), one can obtain different kinds of second order methods.

3.2 Global Convergence Analysis

In the previous analysis, the theory is local and approximate Newton can achieve a fast convergence rate once the iterations enter a suitable basin of the optimal point. In this section, we are going to obtain global convergence results for self-concordant functions. The self-concordant assumption is widely studied in the global convergence analysis of Newton methods (Pilanci and Wainwright, 2017; Boyd and Vandenberghe, 2004). Note that a closed, convex function F: Rd R is called self-concordant if

d dα 2F(x + αv)|α=0 2 v x 2F(x)

for all x in the domain of F(x) and v Rd. Here v x = (v T 2F(x)v)1/2 is the local norm. To achieve a global convergence, the approximate Newton method should combine with the line search. At the damped phase where [ F(x(t))]T p(t) is large, a line search is applied to guarantee the

APPROXIMATE NEWTON METHODS

Algorithm 3 Sketch Newton.

1: Input: x(0), 0 < δ < 1, 0 < ϵ0 < 1; 2: for t = 0, 1, . . . until termination do 3: Construct an ϵ0-subspace embedding matrix S for B(x(t)) and where 2F(x) is of the form 2F(x) = (B(x(t)))T B(x(t)), and calculate H(t) = [B(x(t))]T ST SB(x(t)); 4: Calculate p(t) argminp 1 2p T H(t)p p T F(x(t));

5: Update x(t+1) = x(t) p(t); 6: end for

convergence of approximate Newton methods. Once [ F(x(t))]T p(t) is sufﬁcient small, then step size s = 1 can keep approximate Newton converging with a linear rate. The detailed algorithmic description of approximate Newton with backtracking line search is presented in Algorithm 2. In the following theorem, we provide the iteration complexity of Algorithm 2 to achieve an ϵ-suboptimality.

Theorem 5 Assume the objective function F(x) is self-concordant, and H(t) is a positive deﬁnite matrix satisfying Eqn. (2) with 0 ϵ0 < 1. Let p(t) be a descent direction satisfying Eqn. (4). To achieve F(x(T)) F(x ) ϵ, the iteration complexity of the approximate Newton method with backtracking line search (Algorithm 2) is at most

T = F(x(0)) F(x )

η + 2 1 ϵ0 2ϵ1κ 1 log 1 ϵ0 2ϵ1κ 1

where η is deﬁned as

η = αβ (1 ϵ0)ρ2(1 ϵ0 2ϵ1κ 1)2

(1 ϵ0)(1 ϵ0 2ϵ1κ 1) , with ρ =

1 ϵ1κ 1 1+ϵ0 1 ϵ0

(1+ϵ0)1/2 1+ϵ1κ 1 1+ϵ0 1 ϵ0

Remark 6 In the above theorem, the iteration complexity of approximate Newton with line search still depends on the condition number of the objective function even it is self-concordant. This dependency on the condition number is caused by the approximation to H 1 F(x). If ϵ1 = 0 in Eqn. (4), we can obtain that η = αβ (1 ϵ0)3

144(1+ϵ0)+12(1+ϵ0)1/2(1 ϵ0)3/2 which is independent on the condition number. Thus, the total complexity is independent on the condition number.

4. The Sketch Newton Method

In this section, we use Theorem 3 to analyze the convergence properties of Sketch Newton which utilizes the sketching technique to approximate the Hessian which satisﬁes the form

2F(x) = B(x)T B(x), (9)

where B(x) is an explicitly available n d matrix. Our result can be easily extended to the case that 2F(x) = B(x)T B(x) + Q(x), where Q(x) is a positive semi-deﬁnite matrix related to the Hessian of regularizer.

YE, LUO, AND ZHANG

Table 1: Comparison with previous work with leverage score sampling matrix.

Reference Sketching size Condition number free?

Pilanci and Wainwright (2017) O dκ2 log d

Xu et al. (2016) O dκ log d

Our result (Theorem 7) O d log d

The Sketch Newton method constructs the approximate Hessian matrix as follows:

H(t) = [S(t)B(x)]T S(t)B(x) (10)

where S(t) Rℓ n is a randomized sketching matrix. The approximate Newton method with such Hessian approximation is referred to as the Sketch Newton method. The detailed algorithmic description is given in Algorithm 3.

Theorem 7 Let F(x) satisfy the conditions described in Theorem 3. Assume the Hessian matrix is given as Eqn. (9). Let 0 < δ < 1, 0 < ϵ0 < 1/2 and 0 ϵ1 < 1 be given. S(t) Rℓ n is an ϵ0-subspace embedding matrix w.r.t. B(x) with probability at least 1 δ. Then sketch Newton (Algorithm 3) has the following convergence properties:

(a) There exists a sufﬁcient small value γ and ν = o(1) such that when x(t) x M γ, each iteration satisﬁes Eqn. (5) with probability at least 1 δ.

(b) If 2F(x(t)) is also Lipschitz continuous and {x(t)} satisﬁes Eqn. (6), then each iteration satisﬁes Eqn. (7) with probability at least 1 δ.

(c) If F(x) is self-concordant, the iteration complexity of the sketch Newton with backtracking line search (Algorithm 2 with H(t) constructed as Eqn. (10)) is upper bounded as Eqn. (8).

Theorem 7 directly provides a bound of the sketching size. Using the leverage score sampling matrix as an example, the sketching size ℓ= O(d log d/ϵ2 0) is sufﬁcient. We compare our theoretical bound of the sketching size with the ones of Pilanci and Wainwright (2017) and Xu et al. (2016). We present Eqn. (14) of Pilanci and Wainwright (2017) as follows

x(t+1) x ϵ 0κ x(t) x + 4ˆL

To achieve a linear convergence with rate ϵ0, it requires that ϵ 0κ ϵ0, that is, ϵ 0 ϵ0κ 1. Combining with the properties of leverage score sampling, we have that the sketching size is required as ℓ= O(d log d/(ϵ 0)2) = O(dκ2 log d/ϵ2 0). Similarly, we rewrite Eqn. (5) in Lemma 2 of Xu et al. (2016) as follows

x(t+1) x 3ϵ 0 κ 1 ϵ 0 x(t) x + 2ˆL (1 ϵ 0)µ

APPROXIMATE NEWTON METHODS

To achieve a linear convergence with rate ϵ0, it requires that 3ϵ 0 κ 1 ϵ 0 ϵ0, that is, ϵ 0 ϵ0/(3 κ + ϵ0). Combining with the properties of leverage score sampling, we obtain that the sketching size is required as ℓ= O(d log d/(ϵ 0)2) = O(dκ log d/ϵ2 0). We list the detailed comparison in Table 1. As we can see, our sketching size is much smaller than the other two, especially when the Hessian matrix is ill-conditioned. Theorem 7 shows that the sketching size ℓis independent on the condition number of the Hessian matrix 2F(x) just as shown in Table 1. This explains the phenomena that when the Hessian matrix is ill-conditioned, Sketch Newton performs well even when the sketching size is only several times of d. Furthermore, the iteration complexity of the sketch Newton with backtracking line search shares the similar result with that of Pilanci and Wainwright (2017). Especially when ϵ1 = 0, Eqn. (8) reduces to

T = F(x(0)) F(x )

η + 4 log 1

, with η = αβ(1 ϵ0)3

12 12(1+ϵ0)+(1+ϵ0)1/2(1 ϵ0)3/2 .

We observe that T is independent on the condition number of the objective function. A similar result can be found in Theorem 2 of Pilanci and Wainwright (2017). Theorem 7 also contains the possibility of achieving an asymptotically super-linear rate by using an iteration-dependent sketching accuracy ϵ0 = ϵ0(t). In particular, we present the following corollary.

Corollary 8 Assume F(x) satisﬁes the properties described in Theorem 3. Consider the approximate Hessian H(t) constructed in Eqn. (10) with the iteration-dependent sketching accuracy given as ϵ0(t) = 1 log(1+t), and p(t) = [H(t)] 1 F(x). If the initial point x(0) is close enough to the optimal

point x , then sequence {x(t)} of the Sketch Newton (Algorithm 1 with H(t) constructed in Eqn. (10)) converges superlinearly.

5. The Subsampled Newton method and Variants

In this section we apply Theorem 3 to analyze subsampled Newton methods. Without the assumption that the Hessian can be presented as Eqn. (9), for subsampled Newton methods, we assume that the Hessian is the sum of individual Hessians:

i=1 2fi(x), with 2fi(x) Rd d. (11)

We make the assumption that each fi(x) and F(x) have the following properties:

max 1 i n 2fi(x) K < , (12)

λmin( 2F(x)) µ > 0. (13)

Accordingly, we can deﬁne a new type of condition number ˆκ = K

µ . In our earlier version (Ye et al., 2017), we apply Theorem 3 to analyze the convergence properties of subsampled Newton methods. However, the sample sizes obtained in the earlier version is no better than the ones obtained in (Erdogdu and Montanari, 2015; Roosta-Khorasani and Mahoney, 2019). In this version, we give a much tighter bound on the sample sizes of subsampled Newton method and its variants. This improvement is because we use the matrix Chernoff inequality to bound the sample size instead of the matrix Bernstein inequality.

YE, LUO, AND ZHANG

Algorithm 4 Subsampled Newton (SSN).

1: Input: x(0), 0 < δ < 1, 0 < ϵ0 < 1; 2: Set the sample size |S| 3K/µ log(2d/δ)

3: for t = 0, 1, . . . until termination do 4: Select a sample set S of size |S|, and H(t) = 1 |S| P

j S 2fj(x(t));

5: Calculate p(t) argminp 1 2p T H(t)p p T F(x(t));

6: Update x(t+1) = x(t) p(t); 7: end for

5.1 The Subsampled Newton Method

The Subsampled Newton method is depicted in Algorithm 4 and the approximate Hessian is constructed by sampling:

j S 2fj(x(t)), with j being uniformly sampled from 1, . . . , n. (14)

First, by the properties of uniform sampling and matrix Chernoff inequality. The approximate Hessian in Eqn. (14) has the following properties.

Lemma 9 Assume Eqn. (12) and Eqn. (13) hold, and let 0 < δ < 1 and 0 < ϵ0 < 1/2 be given. The sample size |S| satisﬁes |S| 3K/µ log(2d/δ)

ϵ2 0 . With probability at 1 δ, the approximate Hessian

H(t) is constructed in Eqn. (14) satisﬁes

(1 ϵ0)H(t) 2F(x(t)) (1 + ϵ0)H(t).

Above lemma gives the upper bound of sample size to guarantee the approximate Hessian in Eqn. (14) to satisfy Eqn. (2). Since Eqn. (2) is satisﬁed, the local and global convergence properties of the Subsampled Newton can be derived by Theorem 3 and Theorem 5. We give its local and global convergence properties in the following theorem.

Theorem 10 Let F(x) satisfy the properties described in Theorem 3. Assume Eqn. (12) and Eqn. (13) hold, and let 0 < δ < 1, 0 < ϵ0 < 1/2 and 0 ϵ1 < 1 be given. The sample size |S| satisﬁes |S| 3K/µ log(2d/δ)

ϵ2 0 . The approximate Hessian H(t) is constructed in Eqn. (14), and the direction

vector p(t) satisﬁes Eqn. (4). Then for t = 1, . . . , T, Algorithm 4 has the following convergence properties:

(a) There exists a sufﬁcient small value γ and ν = o(1) such that when x(t) x M γ, each iteration satisﬁes Eqn. (5) with probability at least 1 δ.

(b) If 2F(x(t)) is also Lipschitz continuous and {x(t)} satisﬁes Eqn. (6), then each iteration satisﬁes Eqn. (7) with probability at least 1 δ.

(c) If F(x) is self-concordant, the iteration complexity of the sketch Newton with backtracking line search (Algorithm 2 with H(t) constructed in Eqn. (14)) is upper bounded by Eqn. (8).

APPROXIMATE NEWTON METHODS

Algorithm 5 Regularized Subsample Newton (Reg SSN).

1: Input: x(0), 0 < δ < 1, regularizer parameter α, sample size |S| ; 2: for t = 0, 1, . . . until termination do 3: Select a sample set S of size |S|, and H(t) = 1 |S| P

j S 2fj(x(t)) + αI;

4: Calculate p(t) argminp 1 2p T H(t)p p T F(x(t))

5: Update x(t+1) = x(t) p(t); 6: end for

Remark 11 We see that subsampled Newton with the sample size O K/µ ϵ 2 0 ) can achieve a linear convergence rate ϵ0. In contrast, previous works require sample sizes at least to be O K2/µ2 ϵ 2 0 (Erdogdu and Montanari, 2015; Roosta-Khorasani and Mahoney, 2019). This difference comes from the way to derive the precision of the approximate Hessian in Lemma 9. First, previous works have to bound the approximation error as 2F(x(t)) H(t) ϵ0µ by the matrix Bernstein inequality. This will lead to the sample size bound O K2/µ2 ϵ 2 0 (Lemma 2 of Roosta-Khorasani and Mahoney (2019)). In contrast, our method only requires Condition (2). Thus, we can use the matrix Chernoff inequality (Lemma 17) which only depends on K/µ linearly. Detailed proof of Theorem 10 can be found in Appendix E. Similarly, we can obtain tighter bounds of sample size for other variants of subsampled Newton methods.

As we can see, Algorithm 4 almost has the same convergence properties as Algorithm 3 except several minor differences. The main difference is the construction manner of H(t) which should satisfy Eqn. (2). Algorithm 4 relies on the assumption that each 2fi(x) is upper bounded (i.e., Eqn. (12) holds), while Algorithm 3 is built on the setting of the Hessian matrix as in Eqn. (9).

5.2 Regularized Subsampled Newton

In the ill-conditioned case (i.e., ˆκ = K

µ is large), the subsampled Newton method in Algorithm 4 should take a lot of samples because the sample size |S| depends on K

µ linearly. To overcome this problem, one resorts to a regularized subsampled Newton method which adds a regularizer to the original subsampled Hessian:

j S 2fj(x(t)) + ξ I (15)

where ξ > 0 is the regularization parameter. The detailed algorithmic procedure of the regularized subsampled Newton is described in Algorithm 5. In the following analysis, we prove that adding a regularizer is an effective way to reduce the sample size while keeping convergence in theory. First, we prove that the sample size |S| is properly chosen, the approximate Hessian in Eqn. (15) satisﬁes condition (2).

Lemma 12 Assume Eqn. (12) and (13) hold, and let 0 < δ < 1, and 0 < ξ be given. Assume the sample size |S| satisﬁes |S| = 18(K+ξ) log(2d/δ)

µ+ξ , and H(t) is constructed as in Algorithm 5. With probability at 1 δ, the approximate Hessian H(t) is constructed in Eqn. (15) satisﬁes

(1 ϵ0)H(t) 2F(x(t)) (1 + ϵ0)H(t), with ϵ0 = max 3ξ + µ

3ξ + 3µ, L 2ξ

YE, LUO, AND ZHANG

Combining above lemma with Theorem 3 and Theorem 5, we can obtain the local and global convergence properties of Algorithm 5 in the following theorem.

Theorem 13 Let F(x) satisfy the properties described in Theorem 3. Assume Eqn. (12) and (13) hold, and let 0 < δ < 1, 0 ϵ1 < 1 and 0 < ξ be given. Assume the sample size |S| satisﬁes |S| = 18(K+ξ) log(2d/δ)

µ+ξ , and H(t) is constructed as in Algorithm 5. Deﬁne

ϵ0 = max 3ξ + µ

3ξ + 3µ, L 2ξ

which implies that 0 < ϵ0 < 1. Moreover, the direction vector p(t) satisﬁes Eqn. (4). Then Algorithm 5 has the following convergence properties:

(a) There exists a sufﬁcient small value γ and ν = o(1) such that when x(t) x M γ, each iteration satisﬁes Eqn. (5) with probability at least 1 δ.

(b) If 2F(x(t)) is also Lipschitz continuous and {x(t)} satisﬁes Eqn. (6), then each iteration satisﬁes Eqn. (7) with probability at least 1 δ.

(c) If F(x) is self-concordant, the iteration complexity of the sketch Newton with backtracking line search (Algorithm 2 with H(t) constructed in Eqn. (15)) is upper bounded by Eqn. (8).

In Theorem 13 the parameter ϵ0 mainly decides convergence properties of Algorithm 5. It is determined by two terms just as shown in Eqn. (16). These two terms depict the relationship among the sample size, regularizer ξ I, and convergence rate. We can observe that the sample size |S| = 18(K+ξ) log(2d/δ)

µ+ξ decreases as ξ increases. Hence Theorem 13 gives a theoretical guarantee that adding the regularizer ξ I is an effective approach for reducing the sample size when K/µ is large. Conversely, if we want to sample a small part of fi s, we should choose a large ξ. Although a large ξ can reduce the sample size, it is at the expense of a slower convergence rate. As we can see, 3ξ+µ

3ξ+3µ goes to 1 as ξ increases. At the same time, ϵ1 also has to decrease. Otherwise, ϵ0 + ϵ1 may be beyond 1 which means that Algorithm 5 will not converge. In fact, a slower convergence rate for the regularized subsampled Newton method is because the sample size becomes small, which implies less curvature information is obtained. However, a small sample size implies a low computational cost in each iteration. Therefore, a proper regularizer that balances the cost of each iteration and convergence rate is the key in the regularized subsampled Newton algorithm.

5.3 New Samp

Erdogdu and Montanari (2015) proposed New Samp which is another regularized subsampled Newton method. New Samp constructs its approximate Hessian as follows:

H(t) = H(t) |S| + U\r(ˆλ(t) r+1I ˆΛ\r)UT \r, (17)

where H(t) |S| = 1

j S 2fj(x(t)),

APPROXIMATE NEWTON METHODS

Algorithm 6 New Samp.

1: Input: x(0), 0 < δ < 1, r, sample size |S|; 2: for t = 0, 1, . . . until termination do 3: Select a sample set S of size |S|, and get H(t) |S| = 1 |S| P j S 2fj(x(t));

4: Compute rank r + 1 truncated SVD deompostion of H(t) |S| to get Ur+1 and ˆΛr+1. Construct H(t) =

H(t) |S| + U\r(ˆλ(t) r+1I ˆΛ\r)U T \r 5: Calculate p(t) argminp 1 2p T H(t)p p T F(x(t))

6: Update x(t+1) = x(t) p(t); 7: end for

and its SVD decomposition is

H(t) |S| = U ˆΛUT = Ur ˆΛr UT r + U\r ˆΛ\r UT \r.

The detailed algorithm is depicted in Algorithm 6. H(t) constructed above sets λi(H(t)) with i r + 1 to λr(H(t) S ) which aims to achieve a small condition number and a smaller sample size. Note that, the above approach for constructing the approximate Hessian is only for convenience of convergence analysis. Erdogdu and Montanari (2015) provided a computation efﬁcient implementation of New Samp. We now give a novel analysis of convergence properties of New Samp (Algorithm 6) based on our framework in Theorem 3 and Theorem 5. Our convergence analysis provides a tighter bound on the samples size than the one of Erdogdu and Montanari (2015). We ﬁrst give how the approximate Hessian in Eqn. (17) approximates the Hessian.

Lemma 14 Assume Eqn. (12) and Eqn. (13) hold, and let 0 < δ < 1 and target rank r be given. Let λr+1 be the (r + 1)-th eigenvalue of 2F(x(t)). Set the sample size |S| 18K log(2d/δ)

λr+1 . Then, with

probability at 1 δ, the approximate Hessian H(t) is constructed in Eqn. (17) satisﬁes

(1 ϵ0)H(t) 2F(x(t)) (1 + ϵ0)H(t), with ϵ0 = max 5λr+1 + µ

5λr+1 + 3µ, 1

Combining above lemma with Theorem 3 and Theorem 5, we can obtain the local and global convergence properties of Algorithm 6 in the following theorem.

Theorem 15 Let F(x) satisfy the properties described in Theorem 3. Assume Eqn. (12) and Eqn. (13) hold, and let 0 < δ < 1 and target rank r be given. Let λr+1 be the (r + 1)-th eigenvalue of 2F(x(t)). Set the sample size |S| 18K log(2d/δ)

λr+1 , and deﬁne

ϵ0 = max 5λr+1 + µ

5λr+1 + 3µ, 1

which implies 0 < ϵ0 < 1. Assume the direction vector p(t) satisﬁes Eqn. (4). Then for t = 1, . . . , T, Algorithm 6 has the following convergence properties:

(a) There exists a sufﬁcient small value γ and ν = o(1) such that when x(t) x M γ, each iteration satisﬁes Eqn. (5) with probability at least 1 δ.

YE, LUO, AND ZHANG

Table 2: Comparison with previous work. We use (Reg)SSN to denote the (regularized) subsampled Newton method. For all algorithms, we set ϵ1 = 0. The notation O( ) hides the polynomial of log(d/δ). We obtain the iteration complexities of algorithms by their local convergence rates.

Method Reference Sample Size Iterations Complexity

SSN Theorem 5 of Roosta-Khorasani and Mahoney (2019) O(K2/µ2) O(log(1/ϵ)) Theorem 10 O(K/µ) O(log(1/ϵ))

Reg SSN Theorem 13 O (K/ξ) O ξ µ log(1/ϵ)

New Samp Theorem 3.2 of Erdogdu and Montanari (2015) O K2/µ2 O(log(1/ϵ))

Theorem 15 O (K/λr+1) O λr+1

(b) If 2F(x(t)) is also Lipschitz continuous and {x(t)} satisﬁes Eqn. (6), then each iteration satisﬁes Eqn. (7) with probability at least 1 δ.

(c) If F(x) is self-concordant, the iteration complexity of the sketch Newton with backtracking line search (Algorithm 2 with H(t) constructed in Eqn. (17)) is upper bounded by Eqn. (8).

The ﬁrst term of the right hand of Eqn. (18) reveals the relationship between the target rank r and sample size. We can observe the sample size is linear in 1/λr+1. Hence, a small r means that small sample size is sufﬁcient. Conversely, if we want to sample a small portion of fi s, we should choose a small r. Eqn. (18) shows that small sample size will lead to a poor convergence rate. If we set r = 0, then ϵ0 will be 1 2µ 5λ1+3µ. Consequently, the convergence rate of New Samp is almost the same as gradient descent. It is worth pointing out that Theorem 15 explains the empirical results that New Samp is applicable in training SVM in which the Lipschitz continuity condition of 2F(x) is not satisﬁed (Erdogdu and Montanari, 2015).

5.4 Comparison with Previous Work

We compare our results in this section with previous work. Although many variants of subsampled Newton methods have been proposed recently, they share a similar proof procedure. Thus, these algorithms have almost the same sample size and convergence rate. For example, the subsampled Newton method (Roosta-Khorasani and Mahoney, 2019) and New Samp (Erdogdu and Montanari, 2015) have the same order of sample size and convergence rate (referring to Table 2). Thus, we only compare our results with the recent work of Roosta-Khorasani and Mahoney (2019) and New Samp (Erdogdu and Montanari, 2015). The detailed comparison is listed in Table 2. First, comparing our analysis of subsampled Newton with the one of Roosta-Khorasani and Mahoney (2019), we can see that to achieve the same convergence rate, our result only needs O(K/µ) in contrast to O(K2/µ2) of Roosta-Khorasani and Mahoney (2019). Hence, our result is substantially much tighter than previous work. Then we compare our theoretical analysis of New Samp with that of Erdogdu and Montanari (2015). We can observe that although New Samp is a kind of regularized subsampled Newton, it still takes O(K2/µ2) samples which are the same with subsampled Newton. In contrast, our analysis (Theorem 15) describes how regularization reduces the sample number and convergence speed. This

APPROXIMATE NEWTON METHODS

Table 3: Datasets Description Dataset n d source mushrooms 8, 124 112 UCI a9a 32, 561 123 UCI Covertype 581, 012 54 UCI

theory matches the empirical study that a small r (implying a large λr+1) will reduce the sample number and convergence speed (Erdogdu and Montanari, 2015). Finally, we compare New Samp with regularized subsampled Newton (Algorithm 5). We mainly focus on the parameter ϵ0 in Theorems 13 and 15 which mainly determines convergence properties of Algorithms 5 and 6. Speciﬁcally, if we set ξ = λr+1 in Eqn. (16), then ϵ0 = 3λr+1+µ 3λr+1+3µ which is of the same order of the ﬁrst term of the right hand of Eqn. (18). Hence, we can regard New Samp as a special case of Algorithm 5. However, New Samp provides an approach for automatic choice of ξ. Recall that New Samp includes another parameter: the target rank r. Thus, New Samp and Algorithm 5 have the same number of free parameters. If r is not properly chosen, New Samp will still have poor performance. Therefore, Algorithm 5 is preferred because New Samp needs extra cost to perform SVD.

6. Empirical Analysis

In this section, we experimentally validate our theoretical results about the unnecessity of the Lipschitz continuity condition of 2F(x), sketching size of the sketch Newton, and how the regularization affects the sample number and convergence rate of regularized Newton. In the following experiments, we choose x(0) = 0 as the initial point. Furthermore, we use p(t) = [H(t)] 1 F(x(t)) as the descent vector which implies ϵ1 = 0. We will obtain different convergence rates ϵ0 by choosing different sketching or sample sizes.

6.1 Unnecessity of Lipschitz Continuity of Hessian

We conduct the experiment on the primal problem for the linear SVM as follows

min x F(x) = 1

i=1 ℓ(bi, x, ai ),

where C > 0, the (ai, bi) denote the training data, x is the separating hyperplane, and ℓ( ) is the loss function. In our experiment, we use Hinge-2 loss as our loss function. That is,

ℓ(b, x, a ) = max(0, 1 b x, a )2.

Let SV (t) denote the set of indices of all the support vectors at iteration t, namely,

SV (t) = {i : bi x(t), ai < 1}.

Then the Hessian matrix of F(x(t)) can be written as

2F(x(t)) = I + 1

i SV (t) aia T i .

YE, LUO, AND ZHANG

0 5 10 15 20 25 iteration

subsampled Newton Newton

(a) mushrooms.

0 5 10 15 20 25 iteration

subsampled Newton Newton

0 5 10 15 20 25 iteration

subsampled Newton Newton

(c) covtype.

Figure 1: Convergence properties on different datasets.

From the above equation, we can see that 2F(x) is not Lipschitz continuous. Without loss of generality, we use the Subsampled Newton method (Algorithm 4) in our experiment. We sample 5% support vectors in each iteration. Our experiments are implemented on three datasets whose detailed description is given in Table 3 and the results are reported in Figure 1. From Figure 1, we see that Subsampled Newton converges linearly and the Newton method converges superlinearly. This matches our theory that the Lipschitz continuity of 2F(x) is not necessary to achieve a linear or superlinear convergence rate.

6.2 Sketching Size of Sketch Newton

Now we validate our theoretical result that sketching size is independent on the condition number of the Hessian matrix in Sketch Newton. To control the condition number of the Hessian conveniently, we conduct the experiment on the least squares regression problem:

min x 1 2 Ax b 2. (19)

In each iteration, the Hessian matrix is AT A. In our experiment, A is a 10000 54 matrix of the form A = UΣV . U and V are orthonormal matrices and Σ is a diagonal matrix whose diagonal entries are singular values of A. Vector b is a d dimension Gaussian vector. We set the singular values σi of A as:

σi = 1.2 i and σi = 1.1 i.

Then the condition numbers of A are κ(A) = 1.254 = 1.8741 104 and κ(A) = 1.154 = 172, respectively. We use different sketching matrices in Sketch Newton (Algorithm 3). We control the value of ϵ0 by choosing different sketching sizes ℓ. We report our empirical results in Figure 2. Figure 2 shows that Sketch Newton performs well when the sketch size ℓis several times of d for all different sketching matrices. Moreover, the corresponding algorithms converge linearly. Furthermore, we see that the convergence rate of Sketch Newton is independent on the condition number of the function but depends on both the sketching size and sketching matrix. This matches our theory that sketching size is independent on the condition number of the Hessian matrix to achieve a linear convergence rate.

APPROXIMATE NEWTON METHODS

5 10 15 20 25 30 35 40 45 50 iteration

(a) Gaussian Sketching.

5 10 15 20 25 30 35 40 45 50 iteration

(b) Sparse Sketching.

5 10 15 20 25 30 iteration

(c) leverage score sampling.

5 10 15 20 25 30 35 40 45 50 iteration

(d) Gaussian Sketching.

5 10 15 20 25 30 35 40 45 50 iteration

(e) Sparse Sketching.

5 10 15 20 25 30 iteration

(f) leverage score sampling.

Figure 2: Convergence properties of different sketching sizes. In the top row, the condition number of the Hessian is κ = κ2(A) = 1722. In the bottom row, the condition number of the Hessian is κ = κ2(A) = 1.872 108.

6.3 Sample Size of Regularized Subsampled Newton

We also consider the least squares regression deﬁned in Eqn. (19) in our experiment to validate the theory that adding a regularizer is an effective approach to reducing the sample size while keeping convergence in Subsampled Newton. Let A Rn d be a random matrix whose entries are drawn from [0, 1] uniformly with n = 6, 000 and d = 5, 000. In this case, Sketch Newton can not be used because n and d are close to each other. Thus, we conduct experiments on the regularized subsampled Newton (Algorithm 5).

By Theorem 13, we know that ϵ0 and sample size |S| depend on the value of regularizer ξ. In our experiment, we study how |S| and ξ effect the convergence rate of ϵ0. We set different sample sizes |S| and set different regularizer terms ξ for each |S|. We report our results in Figures 3.

As we can see, if the sample size |S| is small, we have to choose a large ξ in Algorithm 5; otherwise the algorithm will diverge. However, if the regularizer term ξ is too large, it will lead to a slow convergence rate. Furthermore, increasing the sample size and choosing a proper regularizer will improve the convergence rate obviously. When |S| = 300, we can obtain a solution with precision of 10 4 with proper ξ in contrast to 10 1.5 for |S| = 50. Thus, our empirical study matches the theoretical analysis in Subsection 5.2 very well.

YE, LUO, AND ZHANG

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 iteration

(a) Sample Size |S| = 50.

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 iteration

(b) Sample size |S| = 150.

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 iteration

(c) Sample size |S| = 300.

Figure 3: Convergence properties of Regularized Subsampled Newton

7. Conclusion

In this paper we have proposed a framework to analyze both local and global convergence properties of second order methods including stochastic and deterministic versions. This framework reveals some important convergence properties of the subsampled Newton method and sketch Newton method, which were unknown before. The most important thing is in that our analysis lays the theoretical foundation of several important stochastic second order methods. We believe that this framework might also provide some useful insights for developing new subsampled Newton-type algorithms. We would like to address this issue in the future.

Acknowledgments

We thank the anonymous reviewers for their helpful suggestions. Zhihua Zhang has been supported by the National Natural Science Foundation of China (No. 11771002), Beijing Natural Science Foundation (Z190001), and Beijing Academy of Artiﬁcial Intelligence (BAAI).

APPROXIMATE NEWTON METHODS

Appendix A. Some Important Lemmas

In this section, we give several important lemmas which will be used in the proof of the theorems in this paper.

Lemma 16 If A and B are d d symmetric positive matrices, and (1 ϵ0)B A (1 + ϵ0)B where 0 < ϵ0 < 1, then we have

(1 ϵ0) I A 1 2 B 1A 1 2 (1 + ϵ0) I

where I is the identity matrix.

Proof Because A (1 + ϵ0)B, we have z T [A (1 + ϵ0)B]z 0 for any nonzero z Rd. This implies z T Az

z T Bz 1 + ϵ0 for any z = 0. Subsequently,

λmax(B 1A) =λmax(B 1/2AB 1/2)

= max u =0 u T B 1/2AB 1/2u

= max z =0 z T Az z T Bz 1 + ϵ0,

where the last equality is obtained by setting z = B 1/2u. Similarly, we have λmin(B 1A) 1 ϵ0. Since B 1A and A1/2B 1A1/2 are similar, the eigenvalues of A1/2B 1A1/2 are all between 1 ϵ0 and 1 + ϵ0.

Lemma 17 (Matrix Chernoff Inequality Tropp et al. (2015)) Let X1, X2, . . . , Xk be independent, random, symmetric, real matrices of size d d with 0 Xi LI, where I is the d d identity matrix. Let Y = Pk i=1 Xi, µmin = λmin(E[Y ]) and µmax = λmax(E[Y ]). Then, we have

P (λmin(Y ) (1 ϵ)µmin) d e ϵ2µmin/2L,

and P (λmax(Y ) (1 + ϵ)µmax) d e ϵ2µmin/3L.

Appendix B. Proofs of Theorem 3

The proof Theorem 3 consists of the following lemmas. First, by Lemma 18, we upper bound x(t+1) x M by three terms. The ﬁrst term dominates the convergence property. The second term depicts how the approximate descent direction affects convergence. The third term is the high order term. In Lemma 19, we prove that the ﬁrst term of right hand of Eqn. (20) is upper bounded by ϵ0 x(t) x and a high order term. Lemma 20 shows that the second term affect the convergence rate at most ϵ1. In Lemma 21, we complete the convergence analysis when the Hessian is continuous near the optimal point but the Hessian is not Lipschitz continuous. If the Hessian is ˆL-Lipschitz continuous, Lemma 22 provides the detailed convergence analysis.

YE, LUO, AND ZHANG

Lemma 18 Letting sequence {x(t)} update as Algorithm 1, then it satisﬁes x(t+1) x M I M 1/2[H(t)] 1M 1/2 x(t) x M + [H(t)] 1 F(x(t)) p(t) M

+ M 1/2[H(t)] 1 F(x(t)) 2F(x )(x(t) x ) , (20)

where M = 2F(x ).

Proof By the update procedure of x(t), we have

x(t+1) x =x(t) x p(t)

=x(t) x [H(t)] 1 F(x(t)) + [H(t)] 1 F(x(t)) p(t)

=x(t) x + [H(t)] 1 F(x(t)) p(t)

[H(t)] 1 2F(x )(x(t) x ) + F(x(t)) 2F(x )(x(t) x ) .

Letting us denote M = 2F(x ), and multiplying M1/2 to the left and right hands of above equality, we can obtain that

M1/2(x(t+1) x ) =M1/2(x(t) x ) M1/2[H(t)] 1M1/2 M1/2(x(t) x )

+ M1/2 [H(t)] 1 F(x(t)) p(t)

M1/2[H(t)] 1 F(x(t)) 2F(x )(x(t) x ) .

Thus, we can obtain that x(t+1) x M I M1/2[H(t)] 1M1/2 x(t) x M + [H(t)] 1 F(x(t)) p(t) M

+ M1/2[H(t)] 1 F(x(t)) 2F(x )(x(t) x ) .

Lemma 19 Assume that the objective function F(x) satisﬁes Assumption 1 and 2. Let M denote 2F(x ), and the approximate Hessian H(t) satisfy Condition (2). Then if is sufﬁcient small with = 2F(x ) 2F(x(t)), (21)

we have I M1/2[H(t)] 1M1/2 ϵ0 + 2κµ 1 1 κµ 1 .

Proof If is sufﬁcient small (which implies that 2F(x ) and 2F(x(t)) are close enough), then we have

λmax [ 2F(x )] 1 2 [H(t)] 1[ 2F(x )] 1 2 =1 + ϵ 0

λmin [ 2F(x )] 1 2 [H(t)] 1[ 2F(x )] 1 2 =1 ϵ 0,

APPROXIMATE NEWTON METHODS

with 0 < ϵ 0 < 1, 0 < ϵ 0 < 1. This will imply that I M1/2[H(t)] 1M1/2 max{ϵ 0, ϵ 0}.

Now we begin to bound the value of ϵ 0. First, by the deﬁnition of λmax( ) for the positive deﬁnite matrix, we have

λmax [ 2F(x )] 1 2 [H(t)] 1[ 2F(x )] 1 2

= max x =0 x T M 1 2 [H(t)] 1M 1 2 x x 2

x T [ 2F(x(t))] 1 2 [H(t)] 1[ 2F(x(t))] 1 2 x x 2

+ x T M 1 2 [H(t)] 1M 1 2 [ 2F(x(t))] 1 2 [H(t)] 1[ 2F(x(t))] 1 2 x

1 + ϵ0 + max x =0

x T M 1 2 [H(t)] 1M 1 2 [ 2F(x(t))] 1 2 [H(t)] 1[ 2F(x(t))] 1 2 x

1 + ϵ0 + max x =0 x T M 1 2 [H(t)] 1M 1 2 x x 2 + max x =0 x T [ 2F(x(t))] 1 2 [H(t)] 1[ 2F(x(t))] 1 2 x x 2 , (22)

where the ﬁrst inequality is because of Condition (2) and Lemma 16. Furthermore, by setting y = M 1 2 x, we have

max x =0 x T M 1 2 [H(t)] 1M 1 2 x x 2 = max y =0 y T [H(t)] 1y

y T M 1y . (23)

We also have M 1 [ 2F(x(t))] 1 M 1 [ 2F(x(t))] 1 , (24)

which implies that for any u = 0, it holds that

M 1 [ 2F(x(t))] 1 u 2 u T M 1 [ 2F(x(t))] 1 u

1 M 1 [ 2F(x(t))] 1 λmin([ 2F(x(t))] 1) u T [ 2F(x(t))] 1u u T M 1u

1 M 1 [ 2F(x(t))] 1 λmin([ 2F(x(t))] 1) [ 2F(x(t))] 1 M 1.

Note that it holds that [ 2F(x(t))] 1 λmin([ 2F(x(t))] 1) κ, we can obtain that 1 κ M 1 [ 2F(x(t))] 1 M 1.

Combining with Eqn. (23), we have

max y =0 y T [H(t)] 1y

y T M 1y 1 1 κ M 1 max y =0 y T [H(t)] 1y y T [ 2F(x(t))] 1y

YE, LUO, AND ZHANG

= 1 1 κµ 1 max x =0 x T [ 2F(x(t))] 1 2 [H(t)] 1[ 2F(x(t))] 1 2 x x 2 ,

where the last equality is using x = [ 2F(x(t))] 1

2 y and M 1 µ 1. Combining with Eqn. (22) and (23), we can obtain

λmax [ 2F(x )] 1 2 [H(t)] 1[ 2F(x )] 1 2

1 + ϵ0 + 1 1 κµ 1 1 max x =0 x T [ 2F(x(t))] 1 2 [H(t)] 1[ 2F(x(t))] 1 2 x x 2

1 + ϵ0 + κµ 1 1 κµ 1 (1 + ϵ0),

where the last inequality is due to Condition (2) and Lemma 16. By ϵ0 < 1, we can obtain that

ϵ 0 ϵ0 + 2κµ 1 1 κµ 1 .

Now we are going to bound the value of ϵ 0. By the deﬁnition of λmin( ), we have

λmin [ 2F(x )] 1 2 [H(t)] 1[ 2F(x )] 1 2

= min x =0 x T M 1 2 [H(t)] 1M 1 2 x x 2

x T [ 2F(x(t))] 1 2 [H(t)] 1[ 2F(x(t))] 1 2 x x 2

+ x T M 1 2 [H(t)] 1M 1 2 [ 2F(x(t))] 1 2 [H(t)] 1[ 2F(x(t))] 1 2 x

1 ϵ0 + min x =0

x T M 1 2 [H(t)] 1M 1 2 [ 2F(x(t))] 1 2 [H(t)] 1[ 2F(x(t))] 1 2 x

1 ϵ0 + min x =0 x T M 1 2 [H(t)] 1M 1 2 x x 2 + min x =0 x T [ 2F(x(t))] 1 2 [H(t)] 1[ 2F(x(t))] 1 2 x x 2 , (25)

where the ﬁrst inequality is because of Condition (2). By Eqn. (24), we have

u T M 1 [ 2F(x(t))] 1 u M 1 [ 2F(x(t))] 1 u 2

u T M 1u 1 + M 1 [ 2F(x(t))] 1 λmin([ 2F(x(t))] 1) u T [ 2F(x(t))] 1u

M 1 1 + κ M 1 [ 2F(x(t))] 1.

Thus, we can obtain that

min x =0 x T M 1 2 [H(t)] 1M 1 2 x x 2 = min y =0 y T [H(t)] 1y

APPROXIMATE NEWTON METHODS

1 1 + κ M 1 min y =0 y T [H(t)] 1y y T [ 2F(x(t))] 1y

1 1 + κµ 1 min x =0 x T [ 2F(x(t))] 1 2 [H(t)] 1[ 2F(x(t))] 1 2 x x 2 ,

where the ﬁrst equality is because of y = M 1 2 x and the last equality is due to x = [ 2F(x(t))] 1

2 y. Combining with Eqn. (25), we can obtain

λmin [ 2F(x )] 1 2 [H(t)] 1[ 2F(x )] 1 2

1 ϵ0 + min x =0

1 1 + κµ 1 1 x T [ 2F(x(t))] 1 2 [H(t)] 1[ 2F(x(t))] 1 2 x x 2

1 ϵ0 2κµ 1 1 + κµ 1 .

Thus, we can obtain that

ϵ 0 ϵ0 + 2κµ 1 1 + κµ 1 .

Therefore, we have I M1/2[H(t)] 1M1/2 max{ϵ 0, ϵ 0} ϵ0 + 2κµ 1 1 κµ 1 .

Lemma 20 Let p(t) satisfy Condition (4) and F(x) satisfy Assumption 1 and 2, then we have [H(t)] 1 F(x(t)) p(t) M ϵ1 x(t) x M . (26)

Proof First, we have [H(t)] 1 F(x(t)) p(t) M = M1/2[H(t)] 1 F(x(t)) H(t)p(t)

(4) ϵ1(1 + ϵ0) 1κ 3/2 M1/2 [H(t)] 1 F(x(t))

(2) ϵ1κ 3/2 M1/2 [ 2F(x(t))] 1 F(x(t))

ϵ1κ 1/2 M1/2 x(t) x

ϵ1 x(t) x M ,

where the last two inequalities follow from the assumptions that F(x) is L-smooth and µ-strongly convex.

In following lemma, we will prove the result of Theorem 3 (a).

YE, LUO, AND ZHANG

Lemma 21 There exists a sufﬁcient small value γ, ν = o(1), such that when x(t) x M γ, the sequence {x(t)} of Algorithm 1 satisﬁes

x(t+1) x M ϵ0 + ϵ1 + 2νµ 1 + 2κµ 1ν(1 + νµ 1)

Proof Because 2F(x) is continuous around x , then existing a sufﬁcient small value γ such that if x(t) x M γ, then it holds that (Ortega and Rheinboldt, 1970)

2F(x ) 2F(x(t)) ν, (27)

and F(x(t)) F(x ) 2F(x )(x(t) x ) M ν x(t) x M . (28)

By Lemma 19, we have

M1/2[H(t)] 1M1/2 1 + ϵ0 + 2κµ 1 1 κµ 1

(27) 1 + ϵ0 + 2κµ 1ν 1 κµ 1ν .

Combining with Lemma 18, 19 and 20, we can obtain that

x(t+1) x M ϵ0 + ϵ1 + 2κµ 1ν 1 κµ 1ν

+ M1/2[H(t)] 1 F(x(t)) 2F(x )(x(t) x )

(28) ϵ0 + ϵ1 + 2κµ 1ν 1 κµ 1ν

+ ν M 1 M1/2[H(t)] 1M1/2 x(t) x M

ϵ0 + ϵ1 + 2νµ 1 + 2κµ 1ν(1 + νµ 1)

From above equation, we can observe that if ϵ0 + ϵ1 < 1 and ν is sufﬁciently small which can be guaranteed by choosing proper γ, then we have x(t+1) x M x(t) x M γ.

In following lemma, we will prove the result of Theorem 3 (b).

Lemma 22 Let the Hessian of F(x) be ˆL-Lipschitz continuous and x(t) satisfy x(t) x M µ3/2 ˆL 1. Then the sequence {x(t)} of Algorithm 1 satisﬁes

x(t+1) x M (ϵ0 + ϵ1) x(t) x M + 4(κ + 1)µ 3/2 ˆL x(t) x 2

Proof By Taylor s expansion at x , we have

F(x(t)) 2F(x )(x(t) x ) = Z 1

0 2F x + s(x(t) x ) 2F(x ) ds (x(t) x ).

APPROXIMATE NEWTON METHODS

Thus, we can obtain that M1/2[H(t)] 1 F(x(t)) 2F(x )(x(t) x )

= M1/2[H(t)] 1M1/2 Z 1

0 M 1/2 2F x + s(x(t) x ) 2F(x ) M 1/2 ds M1/2(x(t) x )

M1/2[H(t)] 1M1/2 | {z } T1

M 1/2 2F(x + s(x(t) x )) M 1/2 I ds | {z } T2

Next, we will bound the value of T1 and T2. By Lemma 19, we have

T1 1 + ϵ0 + 2κµ 1 1 κµ 1 2 + 2κµ 1 ˆL x(t) x

1 κµ 1 ˆL x(t) x 4,

where the last inequality is because of the condition x(t) x M 1

2µ 3 2 κ 1 ˆL 1 and x(t) x µ 1/2 x(t) x M. Letting us represent that

2F(x + s(x(t) x )) = M + ,

then we have

M 1/2(M + )M 1/2 I ds

M 1/2 M 1/2 ds

s(x(t) x ) ds

Therefore, we have M1/2[H(t)] 1 F(x(t)) 2F(x )(x(t) x )

T1 T2 x(t) x M

2µ 3/2 ˆL x(t) x M .

Combining with Lemma 18, 19 and 20, we can obtain that x(t+1) x M ϵ0 + ϵ1 + 2κµ 1 1 κµ 1

YE, LUO, AND ZHANG

+ 2µ 3/2 ˆL x(t) x M

(ϵ0 + ϵ1) x(t) x M + 4κµ 3/2 ˆL x(t) x 2

+ 4µ 3/2 ˆL x(t) x 2

= (ϵ0 + ϵ1) x(t) x M + 4(κ + 1)µ 3/2 ˆL x(t) x 3/2

where the last inequality follows from the conditions x(t) x M 1

2µ 3 2 κ 1 ˆL 1 and x(t) x µ 1/2 x(t) x M.

Appendix C. Proof of Theorem 5

For a self-concordant function F(x), if two points x, y satisfy x y x < 1, where v x = [ 2F(x)]1/2v , we have some useful inequalities:

1. Hessian bound:

(1 x y x)2 2F(y) 2F(x) 1 (1 x y x)2 2F(y). (29)

2. Function value bound:

ζ( y x x) F(y) F(x) F(x)T (y x) ζ ( y x x), (30)

where ζ(α) = α log(1 + α) and ζ (α) = α log(1 α).

This section, we will prove the convergence rate of the damped approximate Newton method. First, we will show the case that V (x) is smaller than a threshold which is mainly determined by how well the Hessian is approximated. In this case, the step size s = 1 will satisfy the exit condition of the line search. Then, we will provide the convergence analysis when V (x) is larger than the threshold where the step size s should be chosen by the line search. Before proving the convergence analysis, we ﬁrst deﬁne some new notation and clarify their relationship. Let us denote

V (x(t)) = [ 2F(x(t))] 1/2 F(x(t)) , (31)

V (x(t)) = [H(t)] 1/2 F(x(t)) , (32)

and ˆV (x(t)) = T F(x(t))p(t) 1/2 . (33)

Lemma 23 Let the approximate Hessian satisfy Eqn. (2) and the descent direction p(t) satisfy Eqn. (4). Then it holds that

1 ϵ1κ 1 1 + ϵ0

APPROXIMATE NEWTON METHODS

x(t) (1 + ϵ0)

1 + ϵ1κ 1 1 + ϵ0

1/2!2 V 2(x(t)).

Proof First, we have

T F(x(t))p(t) = T F(x(t))[H(t)] 1 F(x(t)) + T F(x(t))[H(t)] 1 [H(t)]p(t) F(x(t))

(4) T F(x(t))[H(t)] 1 F(x(t))

T F(x(t))[H(t)] 1 F(x(t)) 1/2 H 1/2 κ 3/2ϵ1 F(x(t))

T F(x(t))[H(t)] 1 F(x(t))

κ 3/2ϵ1 T F(x(t))[H(t)] 1 F(x(t)) H 1/2 H1/2

1 ϵ1κ 1 1 + ϵ0

Similarly, we can obtain that

T F(x(t))p(t)

1 + ϵ1κ 1 1 + ϵ0

V 2(x(t)). (34)

By the condition (2), we can obtain that p(t) 2

x(t) (1 + ϵ0)[p(t)]T [H(t)]p(t). (35)

Furthermore, we have

[p(t)]T [H(t)]p(t) =[p(t)]T F(x(t)) + [H(t)]p(t) F(x(t))

[p(t)]T F(x(t)) + p(t) [H(t)]p(t) F(x(t))

(4) [p(t)]T F(x(t)) + ϵ1κ 3/2 p(t) F(x(t)) .

Furthermore, we have p(t) p(t) [H(t)] 1 F(x(t)) + [H(t)] 1 F(x(t))

(4) ϵ1κ 3/2 [H(t)] 1 F(x(t)) + [H(t)] 1/2 [H(t)] 1/2 F(x(t))

ϵ1κ 3/2 [H(t)] 1 [H(t)]1/2 + [H(t)] 1/2 [H(t)] 1/2 F(x(t))

1 + ϵ1κ 1 1 + ϵ0

1/2! [H(t)] 1/2 [H(t)] 1/2 F(x(t)) ,

and F(x(t)) [H(t)]1/2 [H(t)] 1/2 F(x(t)) .

YE, LUO, AND ZHANG

Thus, we can obtain that

p(t) F(x(t))

1 + ϵ1κ 1 1 + ϵ0

1/2! [H(t)] 1/2 [H(t)]1/2 [H(t)] 1/2 F(x(t)) 2

κ1/2 1 + ϵ0

1 + ϵ1κ 1 1 + ϵ0

1/2! V 2(x(t)).

Therefore, we can obtain that

[p(t)]T [H(t)]p(t) [p(t)]T F(x(t)) + ϵ1κ 3/2 p(t) F(x(t))

[p(t)]T F(x(t)) + ϵ1κ 1 1 + ϵ0

1 + ϵ1κ 1 1 + ϵ0

1/2! V 2(x(t))

1 + ϵ1κ 1 1 + ϵ0

1/2!2 V 2(x(t)).

Combining Eqn. (35), we can obtain

x(t) (1 + ϵ0)

1 + ϵ1κ 1 1 + ϵ0

1/2!2 V 2(x(t)).

Now, we begin to prove the case that V (x(t)) 1 ϵ0 2ϵ1κ 1

12 and the step size s = 1 is sufﬁcient.

Lemma 24 Let the descent direction p(t) satisfy Eqn. (4) and V (x(t)) satisfy

V (x(t)) 1 ϵ0 2ϵ1κ 1

Then the approximate Newton with backtrack line search (Algorithm 2) has the following convergence property

V (x(t+1)) 1 + ϵ0 + 2ϵ1κ 1

2 V (x(t)).

Proof Then we have

V (x(t+1)) = [ 2F(x(t+1))] 1/2 F(x(t+1))

(29) 1 1 p(t) x

[ 2F(x(t))] 1/2 F(x(t+1)) .

By Taylor s expansion of F(x(t+1)) at point x(t), we have [ 2F(x(t))] 1/2 F(x(t+1))

= [ 2F(x(t))] 1/2 F(x(t)) + 2F(x(t))( p(t)) + Z 1

0 [ 2F(x(t) + sp(t)) 2F(x(t))]( p(t))ds

APPROXIMATE NEWTON METHODS

I [ 2F(x(t))]1/2[H(t)] 1[ 2F(x(t))]1/2 [ 2F(x(t))] 1/2 F(x(t)) | {z } T1

+ [ 2F(x(t))]1/2[H(t)] 1[ 2F(x(t))]1/2 [ 2F(x(t))] 1/2 F(x(t)) H(t)p(t) | {z } T2

[ 2F(x(t))] 1/2 2F(x(t) sp(t))[ 2F(x(t))] 1/2 I ds [ 2F(x(t))]1/2p(t) | {z } T3

We are going to bound the above terms. First, by the assumption (2), we have I [ 2F(x(t))]1/2[H(t)] 1[ 2F(x(t))]1/2 ϵ0.

Combining the deﬁnition of V (x), we can obtain

T1 I [ 2F(x(t))]1/2[H(t)] 1[ 2F(x(t))]1/2 [ 2F(x(t))] 1/2 F(x(t))

ϵ0V (x(t)).

Also by the condition (2), we have [ 2F(x(t))]1/2[H(t)] 1[ 2F(x(t))]1/2 (1 + ϵ0).

Combining the condition (4) and the deﬁnition of V (t), we can obtain that

T2 (1 + ϵ0)µ 1/2 ϵ1

F(x(t)) (1 + ϵ0)ϵ1

κ V (x(t)) 2ϵ1

κ V (x(t)).

We also have

[ 2F(x(t))] 1/2 2F(x(t) sp(t))[ 2F(x(t))] 1/2 I ds p(t) x

1 (1 s p(t) x)2 1

p(t) x 1 p(t) x p(t) x .

Next, we will bound the value of p(t) x. We have p(t) x = [ 2F(x(t))]1/2p(t)

= [ 2F(x(t))]1/2[H(t)] 1 F(x(t)) [ 2F(x(t))]1/2[H(t)] 1 F(x(t)) H(t)p(t)

[ 2F(x(t))]1/2[H(t)] 1[ 2F(x(t))]1/2 [ 2F(x(t))] 1/2 F(x(t))

+ [ 2F(x(t))]1/2[H(t)] 1[ 2F(x(t))]1/2 [ 2F(x(t))] 1/2 F(x(t)) H(t)p(t)

YE, LUO, AND ZHANG

= [ 2F(x(t))]1/2[H(t)] 1[ 2F(x(t))]1/2 [ 2F(x(t))] 1/2 F(x(t)) + T2

(1 + ϵ0)V (x(t)) + 2ϵ1

κ V (x(t)).

Combining above results, we can obtain that

V (x(t+1)) 1 1 p(t) x (T1 + T2 + T3)

(ϵ0 + 2ϵ1κ 1)V (x(t)) 1 (1 + ϵ0 + 2ϵ1κ 1)V (x(t)) + (1 + ϵ0 + 2ϵ1κ 1)2V 2(x(t)) (1 (1 + ϵ0 + 2ϵ1κ 1)V (x(t)))2 .

If V (x(t)) satisﬁes that

V (x(t)) 1 (ϵ0 + 2ϵ1κ 1)2

(1 + ϵ0 + 2ϵ1κ 1)2 2 + ϵ0 + 2ϵ1κ 1 + p

(2 + ϵ0 + 2ϵ1κ 1)2 1 + (ϵ0 + 2ϵ1κ 1)2

1 ϵ0 2ϵ1κ 1

V (x(t+1)) 1 + ϵ0 + 2ϵ1κ 1

2 V (x(t)).

Now we begin to analyze the phase that line search should be applied to ﬁnd a step size s < 1. This phase is commonly commonly referred as damped phase.

Lemma 25 Let the approximate Hessian satisfy Eqn. (2) and the descent direction p(t) satisfy Eqn. (4). If it holds that

(1 ϵ0)(1 ϵ0 2ϵ1κ 1)

then Algorithm 2 has the following convergence property

F(x(t+1)) F(x(t)) αβ ρ2 V 2(x(t))

1 + ρ V (x(t)) ,

where ρ is deﬁned as

ρ = (1 ϕ)1/2

(1 + ϵ0)1/2(1 + ϕ), with ϕ = ϵ1κ 1 1 + ϵ0

Proof By the update rule, we can obtain that

F(x(t+1)) (30) F(x(t)) s F(x(t))T p(t) + ζ s p(t) x(t)

=F(x(t)) s ˆV 2(x(t)) s p(t) x(t) log 1 s p(t) x(t)) ,

APPROXIMATE NEWTON METHODS

with 0 s < 1/ V (x(t)). Letting us deﬁne ˆs as

ˆs = ˆV 2(x(t)) ˆV 2(x(t)) + p(t) x(t) p(t) x(t) .

We can use this bound to show the backtracking line search always results in a step size s βˆs. Furthermore, we can obtain that

F(x(t+1)) F(x(t)) ˆV 2(x(t)) p(t) x(t) log

p(t) x(t) ˆV 2(x(t)) + p(t) x(t)

=F(x(t)) ˆV 2(x(t)) p(t) x(t) + log

1 + ˆV 2(x(t)) p(t) x(t)

ˆV 2(x(t))/ p(t) x(t)) 2

2 1 + ˆV 2(x(t))/ p(t) x(t) ,

2 ˆs ˆV 2(x(t))

F(x(t)) α ˆs ˆV 2(x(t)),

where the second inequality follows form the fact that it holds for a > 0 that

a + log(1 + a) + a2

2(1 + a) 0.

The last inequality is because α < 1/2. Since we obtain that F(x(t+1)) F(x(t)) α ˆs ˆV 2(x(t)), we show the exit condition of the line search has satisﬁed. Furthermore, the exit condition holds when the step size satisﬁes s βˆs. Thus, we can obtain that

F(x(t+1)) F(x(t)) αβ ˆs ˆV 2(x(t)).

Next, we will bound the value of ˆs ˆV 2(x(t)). By the deﬁnition of ˆs, we can obtain that

ˆs ˆV 2(x(t)) =

ˆV 2(x(t))/ p(t) x(t) 2

1 + ˆV 2(x(t))/ p(t) x(t) .

By Lemma 23, we have

ˆV (x(t)) p(t) x(t) (1 ϕ)1/2 V (x(t)) (1 + ϵ0)1/2(1 + ϕ) V (x(t)) = (1 ϕ)1/2

(1 + ϵ0)1/2(1 + ϕ),

where ϕ = ϵ1κ 1 1+ϵ0 1 ϵ0

1/2 . Furthermore, we have

ˆs ˆV 2(x(t)) =

ˆV 2(x(t))/ p(t) x(t) 2

1 + ˆV 2(x(t))/ p(t) x(t)

YE, LUO, AND ZHANG

(1+ϵ0)1/2(1+ϕ) ˆV (x(t)) 2

1 + (1 ϕ)1/2

(1+ϵ0)1/2(1+ϕ) ˆV (x(t))

(1 ϕ) (1+ϵ0)1/2(1+ϕ) V (x(t)) 2

1 + (1 ϕ) (1+ϵ0)1/2(1+ϕ) V (x(t)) ,

where the last inequality follows from Lemma 23. Letting us denote ρ = (1 ϕ)1/2

(1+ϵ0)1/2(1+ϕ), then we have

F(x(t+1)) F(x(t)) αβ ˆs ˆV 2(x(t)) F(x(t)) αβ ρ2 V 2(x(t))

1 + ρ V (x(t)) .

By the Condition (2), we have

1 1 ϵ0 V 2(x(t)) V 2(x(t)). (37)

Thus, we can obtain that if V (x) (1 ϵ0)1/21 ϵ0 2ϵ1κ 1

12 , then it holds that V (x) 1 ϵ0

12 . Therefore,

we can obtain that when V (x) (1 ϵ0)1/21 ϵ0 2ϵ1κ 1

12 , it holds that

F(x(t+1)) F(x(t)) αβ ρ2 V 2(x(t))

1 + ρ V (x(t)) .

Combining Lemma 24 and 25, we can obtain the global convergence rate of approximate Newton with backtracking line search. Proof of Theorem 5 Let us denote

(1 ϵ0)(1 ϵ0 2ϵ1κ 1)

(1 ϵ0)(1 ϵ0 2ϵ1κ 1)

= αβ (1 ϵ0)ρ2(1 ϵ0 2ϵ1κ 1)2

144 + 12ρ p

(1 ϵ0)(1 ϵ0 2ϵ1κ 1) .

By Lemma 25, we can obtain that it takes at most

F(x(0)) F(x )

steps in the damped phase because of F(x(t+1)) F(x(t)) η when V (x)

(1 ϵ0)(1 ϵ0 2ϵ1κ 1)

If it holds that V (x)

(1 ϵ0)(1 ϵ0 2ϵ1κ 1)

12 , then we have V (x(t)) 1 ϵ0 2ϵ1κ 1

12 . By Lemma 24, we have

V (x(t+k)) 1 + ϵ0 + 2ϵ1κ 1

k 1 ϵ0 2ϵ1κ 1

APPROXIMATE NEWTON METHODS

Furthermore, the self-concordance of F(x) implies that

F(x(t+k)) F(x ) V (x(t+k)) 1 + ϵ0 + 2ϵ1κ 1

k 1 ϵ0 2ϵ1κ 1

To make the right hand of above equation less than ϵ, then it will take no more than

k = 2 1 ϵ0 2ϵ1κ 1 log 1 ϵ0 2ϵ1κ 1

iterations. Therefore, the total complexity of approximate Newton method with backtracking line search to achieve an ϵ-suboptimality is at most

F(x(0)) F(x )

η + 2 1 ϵ0 2ϵ1κ 1 log 1 ϵ0 2ϵ1κ 1

Appendix D. Proofs of Section 4

Proof of Theorem 7 If S is an ϵ0-subspace embedding matrix w.r.t. B(x(t)), then we have

(1 ϵ0) 2F(x(t)) [B(x(t))]T ST SB(x(t)) (1 + ϵ0) 2F(x(t)). (38)

By simple transformation and omitting ϵ2 0, Eqn. (38) can be transformed into

(1 ϵ0)[B(x(t))]T ST S 2B(x(t)) 2F(x(t)) (1 + ϵ0)[B(x(t))]T ST SB(x(t)).

The convergence rate can be derived directly from Theorem 3 and 5.

Proof of Corollary 8 If 2F(x) is not Lipschitz continuous, then we have

lim sup t x(t+1) x M

x(t) x M = lim sup t

ϵ0(t) + 2ν(t)µ 1 + 2κµ 1ν(t)(1 + ν(t)µ 1)

= lim sup t

1 log(1 + t) + 2ν(t)µ 1 + 2κµ 1ν(t)(1 + ν(t)µ 1)

where ν(t) 0 is because 2F(x(t)) 2F(x ) 0 as x(t) approaches x . If 2F(x) is Lipschitz continuous, then we have

lim sup t x(t+1) x M

x(t) x M lim sup t

ϵ0(t) + +4(κ + 1)µ 3/2 ˆL x(t) x M

= lim sup t

1 log(1 + t) + +4(κ + 1)µ 3/2 ˆL x(t) x M

YE, LUO, AND ZHANG

Appendix E. Proofs of theorems of Section 5

Proof of Lemma 9 Let us denote that

Xi = [ 2F(x(t))] 1/2 2fi(x)[ 2F(x(t))] 1/2, and Y = X

Because 2fi(x) is chosen uniformly, then we have E[Y ] = P i S E[Xi] = SI. Furthermore, by the Condition (12) and (13), we can obtain that

µ and λmax(E[Y ]) = λmin(E[Y ]) = |S|. (39)

By Lemma 17, we have

P (λmin(Y ) (1 ϵ0)|S|) d exp ϵ2 0|S| 2K/µ

Letting us choose |S| = 2K/µ log(d/δ)

ϵ2 0 , then it holds with probability at least 1 δ that

λmin(Y ) (1 ϵ0)|S|,

which implies that

min x Rd x T [ 2F(x(t))] 1/2 P i S 2fi(x) [ 2F(x(t))] 1/2x

x 2 (1 ϵ0)|S|

i S 2fi(x) (1 ϵ0) 2F(x(t)).

By simple transformation and omitting ϵ2 0, the above equation can be represented as

2F(x(t)) (1 + ϵ0)H(t). (40)

Also by Lemma 17, we have

P (λmax(Y ) (1 + ϵ0)|S|) d exp ϵ2 0|S| 3K/µ

By the similar proof of above, we can obtain that if we choose |S| = 3K/µ log(d/δ)

ϵ2 0 , it holds with probability at least 1 δ that (1 ϵ0) H(t) 2F(x(t)).

Combining with Eqn. (40) and by the union bound of probability, we can obtain that if we choose |S| = 3K/µ log(2d/δ)

ϵ2 0 , it holds that

(1 ϵ0)H(t) 2F(x(t)) (1 + ϵ0)H(t),

with probability at least 1 δ.

APPROXIMATE NEWTON METHODS

Proof of Theorem 10 Once the sample size S is properly chosen, the approximate Hessian in Eqn. (14) satisﬁes the condition (2). Then the local and global convergence properties of Algorithm 4 can be obtained by Theorem 3 and Theorem 5, respectively.

Proof of Lemma 12 Let us denote that

Xi = [ 2F(x(t)) + ξI] 1/2 2fi(x) + ξI [ 2F(x(t)) + ξI] 1/2, and Y = X

Then we can obtain that Xi K + ξ

Because 2fi(x) is chosen uniformly, then we have E[Y ] = P i S E[Xi] = SI. Hence, we can obtain that λmax(E[Y ]) = λmin(E[Y ]) = S. (41)

By Lemma 17, we have

P λmin(Y ) 2

3|S| d exp |S| 18(K + ξ)/(µ + ξ)

Letting us choose |S| = 18(K+ξ) log(d/δ)

µ+ξ , then it holds with probability at least 1 δ that

i S 2fi(x) + ξI 2

2F(x(t)) + ξI 2

2F(x(t)), (42)

which implies that

2F(x(t)) 1 + L 2ξ

Also by Lemma 17, we have

P λmax(Y ) 3

2|S| d exp |S| 12(K + ξ)/(µ + ξ)

By the similar proof of above, we can obtain that if we choose |S| = 12(K+ξ) log(d/δ)

µ+ξ , it holds with probability at least 1 δ that

i S 2fi(x) + ξI 3

2F(x(t)) + ξI 3

2F(x(t)), (43)

which implies that 1 3ξ + µ

H(t) 2F(x(t)).

Therefore, by choosing |S| = 18(K+ξ) log(2d/δ)

µ+ξ , then it holds with probability at least 1 δ that

H(t) 2F(x(t)) 1 + L 2ξ

YE, LUO, AND ZHANG

Proof of Theorem 13 Once the sample size S is properly chosen, the approximate Hessian in Eqn. (15) satisﬁes the condition (2) with ϵ0 = max 3ξ+µ 3ξ+3µ, L 2ξ

2(L+ξ) . Then the local and global convergence properties of Algorithm 5 can be obtained by Theorem 3 and Theorem 5, respectively.

Proof of Lemma 14 Let us denote

i S 2fi(x), and H = HS + λr+1I,

where λr+1 is the (r + 1)-th largest eigenvalue of 2F(x(t)). By the proof of Theorem 13 and Eqn. (42), if we choose |S| = 18K log(d/δ)

λr+1 , then we have

3 2F(x(t)) λr+1

Moreover, by Eqn. (43) and choosing |S| = 12K log(d/δ)

λr+1 , we can obtain that

2 2F(x(t)) + λr+1

By Corollary 7.7.4 (c) of Horn and Johnson (2012), Eqn. (44) and (45) imply that

1 3λr+1 λr+1(HS) 2λr+1. (46)

Let us express the SVD of H(t) S as follows

H(t) S = U ˆΛUT = Ur ˆΛr UT r + U\r ˆΛ\r UT \r.

Then H(t) can be represented as

H(t) = HS + U 0 0 0 λr+1(HS)I ˆΛ\r

By Eqn. (44) and 1

3λr+1 λr+1(HS) (Eqn. (46)), we have

3 2F(x(t)) λr+1

3 I + U 0 0 0 λr+1(HS) I ˆΛ\r

which implies that

2F(x(t)) 1 + 1

By Eqn. (45) and (46), we have

2 2F(x(t)) + λr+1

2 I + U 0 0 0 λr+1(HS)I ˆΛ\r

APPROXIMATE NEWTON METHODS

2 2F(x(t)) + 5

which implies that 1 5λr+1 + µ

H(t) 2F(x(t)).

Therefore, if choosing |S| = 18K log(2d/δ)

λr+1 , we can obtain that

1 5λr+1 + µ

H(t) 2F(x(t)) 1 + 1

which concludes the proof.

Proof of Theorem 15 Once the sample size S is properly chosen, the approximate Hessian in Eqn. (17) satisﬁes the condition (2) with ϵ0 = max 5λr+1+µ 5λr+1+3µ, 1

2 . Then the local and global convergence properties of Algorithm 6 can be obtained by Theorem 3 and Theorem 5, respectively.

Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148 4187, 2017.

Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.

Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal. On the use of stochastic hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3): 977 995, 2011.

Kenneth L Clarkson and David P Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the forty-ﬁfth annual ACM symposium on Theory of computing, pages 81 90. ACM, 2013.

Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. In Advances in neural information processing systems, pages 1647 1655, 2011.

Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Sampling algorithms for l 2 regression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1127 1136. Society for Industrial and Applied Mathematics, 2006.

Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13 (Dec):3475 3506, 2012.

YE, LUO, AND ZHANG

Murat A Erdogdu and Andrea Montanari. Convergence rates of sub-sampled newton methods. In Advances in Neural Information Processing Systems, pages 3034 3042, 2015.

N Halko, P G Martinsson, and J A Tropp. Finding Structure with Randomness : Probabilistic Algorithms for Matrix Decompositions. SIAM Review, 53(2):217 288, 2011.

Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012.

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315 323, 2013.

William B. Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics, 26(189-206), 1984.

Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Efﬁcient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661 670. ACM, 2014.

Xiangrui Meng and Michael W Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Proceedings of the forty-ﬁfth annual ACM symposium on Theory of computing, pages 91 100. ACM, 2013.

Jelani Nelson and Huy L Nguyˆen. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on, pages 117 126. IEEE, 2013.

Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4): 1574 1609, 2009.

Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.

James M Ortega and Werner C Rheinboldt. Iterative solution of nonlinear equations in several variables, volume 30. Siam, 1970.

Mert Pilanci and Martin J Wainwright. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205 245, 2017.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400 407, 1951.

Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled newton methods ii: Local convergence rates. ar Xiv preprint ar Xiv:1601.04738, 2016.

Farbod Roosta-Khorasani and Michael W Mahoney. Sub-sampled newton methods. Mathematical Programming, 174(1-2):293 326, 2019.

Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence rate for ﬁnite training sets. In Advances in Neural Information Processing Systems, pages 2663 2671, 2012.

APPROXIMATE NEWTON METHODS

Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing ﬁnite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83 112, 2017.

Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends R in Machine Learning, 8(1-2):1 230, 2015.

Shusen Wang, Fred Roosta, Peng Xu, and Michael W Mahoney. Giant: Globally improved approximate newton method for distributed optimization. In Advances in Neural Information Processing Systems, pages 2332 2342, 2018.

David P Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends R in Theoretical Computer Science, 10(1 2):1 157, 2014.

Peng Xu, Jiyan Yang, Farbod Roosta-Khorasani, Christopher R e, and Michael W Mahoney. Subsampled newton methods with non-uniform sampling. In Advances in Neural Information Processing Systems, pages 3000 3008, 2016.

Haishan Ye, Luo Luo, and Zhihua Zhang. Approximate newton methods and their local convergence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3931 3939. JMLR. org, 2017.

Lijun Zhang, Mehrdad Mahdavi, and Rong Jin. Linear convergence with condition number independent access of full gradients. In Advance in Neural Information Processing Systems 26 (NIPS), pages 980 988, 2013.