# subsampled_newton_methods_with_nonuniform_sampling__228e4e6b.pdf

Sub-sampled Newton Methods with Non-uniform Sampling

Peng Xu Jiyan Yang Farbod Roosta-Khorasani Christopher Ré Michael W. Mahoney Stanford University University of California at Berkeley pengxu@stanford.edu jiyan@stanford.edu farbod@icsi.berkeley.edu chrismre@cs.stanford.edu mmahoney@stat.berkeley.edu

Abstract We consider the problem of ﬁnding the minimizer of a convex function F : Rd R of the form F(w) := Pn i=1 fi(w) + R(w) where a low-rank factorization of 2fi(w) is readily available. We consider the regime where n d. We propose randomized Newton-type algorithms that exploit non-uniform sub-sampling of { 2fi(w)}n i=1, as well as inexact updates, as means to reduce the computational complexity, and are applicable to a wide range of problems in machine learning. Two non-uniform sampling distributions based on block norm squares and block partial leverage scores are considered. Under certain assumptions, we show that our algorithms inherit a linear-quadratic convergence rate in w and achieve a lower computational complexity compared to similar existing methods. In addition, we show that our algorithms exhibit more robustness and better dependence on problem speciﬁc quantities, such as the condition number. We empirically demonstrate that our methods are at least twice as fast as Newton s methods on several real datasets.

1 Introduction

Many machine learning applications involve ﬁnding the minimizer of optimization problems of the form

min w C F(w) :=

i=1 fi(w) + R(w) (1)

where fi(w) is a smooth convex function, R(w) is a regularizer, and C Rd is a convex constraint set (e.g., ℓ1 ball). Examples include sparse least squares [21], generalized linear models (GLMs) [8], and metric learning problems [12].

First-order optimization algorithms have been the workhorse of machine learning applications and there is a plethora of such methods [3, 17] for solving (1). However, for ill-conditioned problems, it is often the case that ﬁrst-order methods return a solution far from w albeit a low objective value. On the other hand, most second-order algorithms prove to be more robust to such adversarial effects. This is so since, using the curvature information, second order methods properly rescale the gradient, such that it is a more appropriate direction to follow. For example, take the canonical second order method, i.e., Newton s method, which, in the unconstrained case, has updates of the form wt+1 = wt [H(wt)] 1g(wt) (here, g(wt) and H(wt) denote the gradient and the Hessian of F at wt, respectively). Classical results indicate that under certain assumptions, Newton s method can achieve a locally super-linear convergence rate, which can be shown to be problem independent! Nevertheless, the cost of forming and inverting the Hessian is a major drawback in using Newton s method in practice. In this regard, there has been a long line of work aiming at providing sufﬁcient second-order information more efﬁciently, e.g., the classical BFGS algorithm and its limited memory version [14, 17].

As the mere evaluation of H(w) grows linearly in n, a natural idea is to use uniform sub-sampling { 2fi(w)}n i=1 as a way to reduce the cost of such evaluation [7, 19, 20]. However, in the presence of high non-uniformity among { 2fi(w)}n i=1, the sampling size required to sufﬁciently capture the

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

curvature information of the Hessian can be very large. In such situations, non-uniform sampling can indeed be a much better alternative and is addressed in this work in detail.

In this work, we propose novel, robust and highly efﬁcient non-uniformly sub-sampled Newton methods (SSN) for a large sub-class of problem (1), where the Hessian of F(w) in (1) can be written as H(w) = Pn i=1 AT i (w)Ai(w) + Q(w), where Ai(w) Rki d, i = 1, 2, . . . , n, are readily available and Q(w) is some positive semi-deﬁnite matrix. This situation arises very frequently in machine learning problems. For example, take any problem where fi(w) = ℓ(x T i w), ℓ( ) is any convex loss function and xi s are data points. In such situations, Ai(w) is simply p

ℓ (x T i w)x T i . Under this setting, non-uniformly sub-sampling the Hessians now boils down to building an appropriate non-uniform distribution to sub-sample the most relevant terms among {Ai(w)}n i=1. The approximate Hessian, denoted by e H(wt), is then used to update the current iterate as wt+1 = wt [ e H(wt)] 1g(wt). Furthermore, in order to improve upon the overall efﬁciency of our SSN algorithms, we will allow for the linear system in the sub-problem to be solved inexactly, i.e., using only a few iterations of any iterative solver such as Conjugate Gradient (CG). Such inexact updates used in many second-order optimization algorithms have been well studied in [4, 5].

As we shall see (in Section 4), our algorithms converge much faster than other competing methods for a variety of problems. In particular, on several machine learning datasets, our methods are at least twice as fast as Newton s methods in ﬁnding a high-precision solution while other methods converge slowly. Indeed, this phenomenon is well supported by our theoretical ﬁndings the complexity of our algorithms has a lower dependence on the problem condition number and is immune to any non-uniformity among {Ai(w)}n i=1 which may cause a factor of n in the complexity (Table 1). In the following we present details of our main contributions and connections to other prior work. Readers interested in more details should see the technical report version of this conference paper [23] for proofs of our main results, additional theoretical results, as well as a more detailed empirical evaluation.

1.1 Contributions and related work Recently, within the context of randomized second order methods, many algorithms have been proposed that aim at reducing the computational costs involving pure Newton s method. Among them, algorithms that employ uniform sub-sampling constitute a popular line of work [4, 7, 16, 22]. In particular, [19, 20] consider a more general class of problems and, under a variety of conditions, thoroughly study the local and global convergence properties of sub-sampled Newton methods where the gradient and/or the Hessian are uniformly sub-sampled. Our work here, however, is more closely related to a recent work [18](Newton Sketch), which considers a similar class of problems and proposes sketching the Hessian using random sub-Gaussian matrices or randomized orthonormal systems. Furthermore, [1] proposes a stochastic algorithm (Li SSA) that, for solving the sub-problems, employs some unbiased estimators of the inverse of the Hessian.

In light of these prior works, our contributions can be summarized as follows.

For the class of problems considered here, unlike the uniform sampling used in [4, 7, 19, 20], we employ two non-uniform sampling schemes based on block norm squares and a new, and more general, notion of leverage scores named block partial leverage scores (Deﬁnition 1). It can be shown that in the case of extreme non-uniformity among {Ai(w)}n i=1, uniform sampling might require Ω(n) samples to capture the Hessian information appropriately. However, we show that our non-uniform sampling schemes result in sample sizes completely independent of n and immune to such non-uniformity. Within the context of globally convergent randomized second order algorithms, [4, 20] incorporate inexact updates where the sub-problems are solved only approximately. We extend the study of inexactness to our local convergence analysis. We provide a general structural result (Lemma 2) showing that, as in [7, 18, 19], our main algorithm exhibits a linear-quadratic solution error recursion. However, we show that by using our nonuniform sampling strategies, the factors appearing in such error recursion enjoy a much better dependence on problem speciﬁc quantities, e.g., such as the condition number (Table 2). For example, using block partial leverage score sampling, the factor for the linear term of the error recursion (5) is of order O( κ) as opposed to O(κ) for uniform sampling. We demonstrate that to achieve a locally problem independent linear convergence rate, i.e., wt+1 w ρ wt w for some ﬁxed ρ < 1, our algorithms achieve a lower per-iteration complexity compared to [1, 18, 20] (Table 1). In particular, unlike Newton Sketch [18], which employs random

Table 1: Complexity per iteration of different methods to obtain a problem independent local linear convergence rate. The quantities κ, ˆκ, and κ are the local condition numbers, deﬁned in (6), satisfying κ ˆκ κ, at the optimum w . A is deﬁned in Assumption A3 and sr(A) is the stable rank of A satisfying sr(A) d. Here we assume ki = 1, C = Rd, R(w) = 0, and CG is used for solving sub-problems in our algorithms.

NAME COMPLEXITY PER ITERATION REFERENCE Newton-CG method O(nnz(A) κ) [17] SSN (leverage scores) O(nnz(A) log n + d2κ3/2) This paper SSN (row norm squares) O(nnz(A) + sr(A)dκ5/2) This paper Newton Sketch (SRHT) O(nd(log n)4 + d2(log n)4κ3/2) [18] SSN (uniform) O(nnz(A) + dˆκκ3/2) [20] Li SSA O(nnz(A) + dˆκ κ2) [1]

projections and fails to preserve the sparsity of {Ai(w)}n i=1, our methods indeed take advantage of such sparsity. Also, in the presence of high non-uniformity among {Ai(w)}n i=1, factors κ and ˆκ (see Deﬁnition (6)) which appear in SSN (uniform) [19], and Li SSA [1], can potentially be as large as Ω(nκ); see Section 3.5 for detailed discussions. We numerically demonstrate the effectiveness and robustness of our algorithms in recovering the minimizer of ridge logistic regression on several real datasets (Figures 1 and 2). In particular, our algorithms are at least twice as fast as Newton s methods in ﬁnding a high-precision solution while other methods converge slowly.

1.2 Notation and assumptions

Given a function F, the gradient, the exact Hessian and the approximate Hessian are denoted by g, H, and e H, respectively. Iteration counter is denoted by subscript, e.g., wt. Unless stated speciﬁcally, denotes the Euclidean norm for vectors and spectral norm for matrices. Frobenius norm of matrices is written as F . By a matrix A having n blocks, we mean that A has a block structure and can be viewed as A = AT 1 AT n T , for appropriate size blocks Ai. The tangent cone of constraint set C at the optimum w is denoted by K and deﬁned as K = { |w + t C for some t > 0}. Given a symmetric matrix A, the K-restricted minimum and maximum eigenvalues of A are deﬁned, respectively, as λK min(A) = minx K\{0} x T Ax/x T x and λK max(A) = maxx K\{0} x T Ax/x T x. The stable rank of a matrix A is deﬁned as sr(A) = A 2 F / A 2 2. We use nnz(A) to denote number of non-zero elements in A.

Throughout the paper, we make use of the following assumptions:

A.1 Lipschitz Continuity: F(w) is convex and twice differentiable with L-Lipschitz Hessian, i.e., H(u) H(v) L u v , u, v C. A.2 Local Regularity: F(x) is locally strongly convex and smooth, i.e., µ = λK min(H(w )) > 0, ν = λK max(H(w )) < . Here we deﬁne the local condition number of the problem as κ := ν/µ. A.3 Hessian Decomposition: For each fi(w) in (1), deﬁne 2fi(w) := Hi(w) := AT i (w)Ai(w). For simplicity, we assume k1 = = kn = k and k is independent of d. Furthermore, we assume that given w, computing Ai(w), Hi(w), and g(w) takes O(d), O(d2), and O(nnz(A)) time, respectively. We call the matrix A(w) = AT 1 , . . . , AT n T Rnk d the augmented matrix of {Ai(w)}. Note that H(w) = A(w)T A(w) + Q(w).

2 Main Algorithm: SSN with Non-uniform Sampling

Our proposed SSN method with non-uniform sampling is given in Algorithm 1. The core of our algorithm is based on choosing a sampling scheme S that, at every iteration, constructs a non-uniform sampling distribution {pi}n i=1 over {Ai(wt)}n i=1 and then samples from {Ai(wt)}n i=1 to form the approximate Hessian, e H(wt). The sampling sizes s needed for different sampling distributions will be discussed in Section 3.2. Since H(w) = Pn i=1 AT i (w)Ai(w) + Q(w), the Hessian approximation essentially boils down to a matrix approximation problem. Here, we generalize the two popular non-uniform sampling strategies, i.e., leverage score sampling and row norm squares sampling, which are commonly used in the ﬁeld of randomized linear algebra, particularly for matrix approximation

problems [10, 15]. With an approximate Hessian constructed via non-uniform sampling, we may choose an appropriate solver A to the solve the sub-problem in Step 11 of Algorithm 1. Below we elaborate on the construction of the two non-uniform sampling schemes.

Block Norm Squares Sampling This is done by constructing a sampling distribution based on the Frobenius norm of the blocks Ai, i.e., pi = Ai 2 F / A 2 F , i = 1, . . . , n. This is an extension to the row norm squares sampling in which the intuition is to capture the importance of the blocks based on the magnitudes of the sub-Hessians [10]. Block Partial Leverage Scores Sampling Recall standard leverage scores of a matrix A are deﬁned as diagonal elements of the hat matrix A(AT A) 1AT [15] which prove to be very useful in matrix approximation algorithms. However, in contrast to the standard case, there are two major differences in our task. First, blocks, not rows, are being sampled. Second, an additional matrix Q is involved in the target matrix, i.e., H. In light of this, we introduce a new and more general notion of leverage scores, called block partial leverage scores. Deﬁnition 1 (Block Partial Leverage Scores). Given a matrix A Rkn d viewed as having n blocks of size k d and a SPSD matrix Q Rd d, let {τi}kn+d i=1 be the (standard) leverage scores

of the augmented matrix A Q 1 2

. The block partial leverage score for the i-th block is deﬁned as

τ Q i (A) = Pki j=k(i 1)+1 τj. Note that for k = 1 and Q = 0, the block partial leverage score is simply the standard leverage score. The sampling distribution is deﬁned as pi = τ Q i (A)/ Pn j=1 τ Q j (A) , i = 1, . . . , n.

Algorithm 1 Sub-sampled Newton method with Non-uniform Sampling

1: Input: Initialization point w0, number of iteration T, sampling scheme S and solver A. 2: Output: w T 3: for t = 0, . . . , T 1 do 4: Construct the non-uniform sampling distribution {pi}n i=1 as described in Section 2. 5: for i = 1, . . . , n do 6: qi = min{s pi, 1}, where s is the sampling size.

7: e Ai(wt) = Ai(wt)/ qi, with probability qi, 0, with probability 1 qi. 8: end for 9: e H(wt) = Pn i=1 e AT i (wt) e Ai(wt) + Q(wt). 10: Compute g(wt) 11: Use solver A to solve the sub-problem inexactly

wt+1 arg min w C{1

2 (w wt), e H(wt)(w wt) + g(wt), w wt }. (2)

12: end for 13: return w T .

3 Theoretical Results

In this section we provide detailed complexity analysis of our algorithm.1 Different choices of sampling scheme S and the sub-problem solver A lead to different complexities in SSN. More precisely, total complexity is characterized by the following four factors: (i) total number of iterations T determined by the convergence rate which is affected by the choice of S and A; see Lemma 2 in Section 3.1, (ii) the time, tgrad, it takes to compute the full gradient g(wt) (Step 10 in Algorithm 1), (iii) the time tconst, to construct the sampling distribution {pi}n i=1 and sample s terms at each iteration (Steps 4-8 in Algorithm 1), which is determined by S; see Section 3.2 for details, and (iv) the time tsolve needed to (implicitly) form H and (inexactly) solve the sub-problem at each iteration (Steps 9 and 11 in Algorithm 1) which is affected by the choices of both S (manifested in the sampling size s) and A see Section 3.2&3.3 for details. With these, the total complexity can be expressed as T (tgrad + tconst + tsolve). (3)

1In this work, we only focus on local convergence guarantees for Algorithm 1. To ensure global convergence, one can incorporate an existing globally convergent method, e.g. [20], as initial phase and switch to Algorithm 1 once the iterate is close enough to the optimum; see Lemma 2.

Below we study these contributing factors. Moreover, the per iteration complexity of our algorithm for achieving a problem independent linear convergence rate is presented in Section 3.4 and comparison to other related work is discussed in Section 3.5.

3.1 Local linear-quadratic error recursion

Before diving into details of the complexity analysis, we state a structural lemma that characterizes the local convergence rate of our main algorithm, i.e., Algorithm 1. As discussed earlier, there are two layers of approximation in Algorithm 1, i.e., approximation of the Hessian by sub-sampling and inexactness of solving (2). For the ﬁrst layer, we require the approximate Hessian to satisfy one of the following two conditions (in Section 3.2 we shall see our construction of approximate Hessian via non-uniform sampling can achieve these conditions with a sampling size independent of n).

e H(wt) H(wt) ϵ H(wt) , (C1)

or |x T ( e H(wt) H(wt))y| ϵ q

x T H(wt)x q

y T H(wt)y, x, y K. (C2)

Note that (C1) and (C2) are two commonly seen guarantees for matrix approximation problems. In particular, (C2) is stronger in the sense that the spectral of the approximated matrix H(wt) is well preserved. Below in Lemma 2, we shall see such a stronger condition ensures a better dependence on the condition number in terms of the convergence rate. For the second layer of approximation, we require the solver to produce an ϵ0-approximate solution wt+1 satisfying

wt+1 w t+1 ϵ0 wt w t+1 , (4)

where w t+1 is the exact optimal solution to (2). Note that (4) implies an ϵ0-relative error approximation to the exact update direction, i.e., v v ϵ v where v = wt+1 wt, v = w t+1 wt.

Lemma 2 (Structural Result). Let ϵ (0, 1/2) and ϵ0 be given and {wt}T i=1 be a sequence generated by (2) which satisﬁes (4). Also assume that the initial point w0 satisﬁes w0 w µ 4L. Under Assumptions A1 & A2, the solution error satisﬁes the following recursion

wt+1 w (1 + ϵ0)Cq wt w 2 + (ϵ0 + (1 + ϵ0)Cl) wt w , (5)

where Cl and Cq are speciﬁed as below.

Cq = 2L (1 2ϵκ)µ and Cl = 4ϵκ 1 2ϵκ, if condition (C1) is met;

Cq = 2L (1 ϵ)µ and Cl = 3ϵ κ

1 ϵ , if condition (C2) is met.

3.2 Complexities related to the choice of sampling scheme S

The following lemma gives the complexity of constructing the sampling distributions used in this paper. Here, we adopt the fast approximation algorithm for standard leverage scores, [6], to obtain an efﬁcient approximation to our block partial leverage scores. Lemma 3 (Construction Complexity). Under Assumption 3, it takes tconst = O(nnz(A)) time to construct a block norm squares sampling distribution, and it takes tconst = O(nnz(A) log n) time to construct, with high probability, a distribution with constant factor approximation to the block partial leverage scores.

The following theorem indicates that if the blocks of the augmented matrix of {Ai(w)} (see Assumption 3) are sampled based on block norm squares or block partial leverage scores with large enough sampling size, (C1) or (C2) holds, respectively, with high probability. Theorem 4 (Sufﬁcient Sample Size). Given any ϵ (0, 1), the following statements hold:

(i) Let ri = Ai 2 F , i = 1, . . . , n, set pi = ri/(Pn j=1 rj) and construct e H as in Steps 5-9 of Algorithm 1. Then if s 4sr(A) log (min{4sr(A), d}/δ) /ϵ2, with probability at least 1 δ, (C1) holds. (ii) Let {ˆτ Q i (A)}n i=1 be some overestimate of the block partial leverage scores, i.e., ˆτ Q i (A) τ Q i (A), i = 1, . . . , n and set pi = ˆτ Q i (A)/(Pn j=1 ˆτ Q j (A)), i = 1, . . . , n. Construct e H as in

Steps 5-9 of Algorithm 1. Then if s 4 Pn i=1 ˆτ Q i (A) log (4d/δ) /ϵ2, with probability at least 1 δ, (C2) holds.

Remarks: Part (i) of Theorem 4 is an extension of [10] to our particular augmented matrix setting. Also, as for the exact block partial leverage scores we have Pn i=1 τ Q i (A) d, part (ii) of Theorem 4 implies that, using exact scores, less than O(d log d/ϵ2) blocks are needed for (C2) to hold.

3.3 Complexities related to the choice of solver A

We now discuss how tsolve in (3) is affected by the choice of the solver A in Algorithm 1. The approximate Hessian e H(wt) is of the form AT A+Q where A Rsk d. As a result, the complexity for solving the sub-problem (2) essentially depends on the choice A, the constraint set C, s and d, i.e., tsolve = T (A, C, s, d). For example, when the problem is unconstrained (C = Rd), CG takes tsolve = O(sd κt log(1/ϵ)) to return a solution with approximation quality ϵ0 = κtϵ in (4) where κt = λmax( e H(wt))/λmin( e H(wt)).

3.4 Total complexity per iteration

Lemma 2 implies that, by choosing appropriate values for ϵ and ϵ0, SSN inherits a local constant linear convergence rate, i.e., wt+1 w ρ wt w with ρ < 1. The following Corollary gives the total complexity per iteration of Algorithm 1 to obtain a locally linear rate. Corollary 5. Suppose C = Rd and CG is used to solve the sub-problem (2). Then under Assumption 3, to obtain a constant local linear convergence rate with a constant probability, the complexity per iteration of Algorithm 1 using the block partial leverage scores sampling and block norm squares sampling is O(nnz(A) log n + d2κ3/2) and O(nnz(A) + sr(A)dκ5/2), respectively. 2

3.5 Comparison with existing similar methods

As discussed above, the sampling scheme S plays a crucial role in the overall complexity of SSN. We ﬁrst compare our proposed non-uniform sampling schemes with the uniform alternative [20], in terms of complexities tconst and tsolve as well as the quality of the locally linear-quadratic error recursion (5), measured by Cq and Cl. Table 2 gives a summary of such comparison where, for simplicity, we assume that k = 1, C = Rd, and a direct solver is used for the linear system subproblem (2). Also, throughout this subsection, for randomized algorithms, we choose parameters such that the failure probability is a constant. One advantage of uniform sampling is its simplicity of construction. However, as shown in Section 3.2, it takes nearly input-sparsity time to construct the proposed non-uniform sampling distribution. In addition, when rows of A are very non-uniform, i.e., maxi Ai A , uniform scheme requires Ω(n) samples to achieve (C1). It can also be seen that for a given ϵ, row norm squares sampling requires the smallest sampling size, yielding the smallest tsolve in Table 2. More importantly, although either (C1) or (C2) is sufﬁcient to give (5), having (C2) as in SSN with leverage score sampling yields constants Cq and Cl with much better dependence on the local condition number, κ, than other methods. This fact can drastically improve the performance of SSN for ill-conditioned problems; see Figure 1 in Section 4.

Table 2: Comparison between standard Newton s methods and sub-sampled Newton methods (SSN) with different sampling schemes. Cq and Cl are the constants appearing in (5), A is the augmented matrix of {Ai(w)} with stable rank sr(A), κ = ν/µ is the local condition number and κ = L/µ. Here, we assume that k = 1, C = Rd, and a direct solver is used in Algorithm 1.

NAME tconst tsolve = sd2 Cq Cl Newton s method 0 O(nd2) κ 0 SSN (leverage scores) O(nnz(A) log n) O((P

i τ Q i (A))d2/ϵ2) κ 1 ϵ ϵ κ 1 ϵ SSN (row norm squares) O(nnz(A)) O(sr(A)d2/ϵ2) κ 1 ϵκ ϵκ 1 ϵκ SSN (uniform) [20] O(1) O nd2 maxi Ai 2

κ 1 ϵκ ϵκ 1 ϵκ

Next, recall that in Table 1, we summarize the per-iteration complexity needed by our algorithm and other similar methods [20, 1, 18] to achieve a given local linear convergence rate. Here we provide more details. First, the deﬁnition of various notions of condition number used in Table 1 is given below. For any given w Rd, deﬁne

κ(w) = λmax(Pn i=1 Hi(w)) λmin(Pn i=1 Hi(w)) , ˆκ(w) = n maxi λmax(Hi(w))

λmin(Pn i=1 Hi(w)), κ(w) = maxi λmax(Hi(w))

mini λmin(Hi(w)) , (6)

2In this paper, O( ) hides logarithmic factors of d, κ and 1/δ.

assuming that the denominators are non-zero. It is easy to see that κ(w) ˆκ(w) κ(w). However, the degree of the discrepancy among these inequalities depends on the properties of Hi(w). Roughly speaking, when all Hi(w) s are similar , one has that λK max(Pn i=1 Hi(w)) Pn i=1 λK max(Hi(w)) n maxi λK max(Hi(w)), and thus κ(w) ˆκ(w) κ(w). However, in many real applications, such uniformity doesn t simply exist. For example, it is not hard to design a matrix A with non-uniform rows such that for H = AT A, ˆκ and κ are larger than κ by a factor of n. This implies although SSN with leverage score sampling has a quadratic dependence on d, its dependence on the condition number is signiﬁcantly better than all other methods such as SSN (uniform) and Li SSA. Moreover compared to Newton s method, all these stochastic variants replace the coefﬁcient of the leading term, i.e., O(nd), with some lower order terms that only depend on d and condition numbers (assuming nnz(A) nd). Therefore, one should expect these algorithms to perform well when n d and the problem is moderately conditioned.

4 Numerical Experiments

We consider an estimation problem in GLMs with Gaussian prior. Assume X Rn d, Y Yn are the data matrix and response vector. The problem of minimizing the negative log-likelihood with ridge penalty can be written as min w Rd

i=1 ψ(x T i w, yi) + λ w 2 2,

where ψ : R Y R is a convex cumulant generating function and λ 0 is the ridge penalty parameter. In this case, the Hessian is H(w) = Pn i=1 ψ (x T i w, yi)xix T i +λI := XT D2(w)X+λI, where xi is i-th column of XT and D(w) is a diagonal matrix with the diagonal [D(w)]ii = p

ψ (x T i w, yi). The augmented matrix of {Ai(w)} can be written as A(w) = DX Rn d where Ai(w) = [D(w)]iix T i .

For our numerical simulations, we consider a very popular instance of GLMs, namely, logistic regression, where ψ(u, y) = log(1 + exp( uy)) and Y = { 1}. Table 3 summarizes the datasets used in our experiments. Table 3: Datasets used in ridge logistic regression. In the above, κ and κ are the local condition numbers of ridge logistic regression problem with λ = 0.01 as deﬁned in (6).

DATASET CT slices[9] Forest[2] Adult[13] Buzz[11] n 53,500 581,012 32,561 59,535 d 385 55 123 78 κ 368 221 182 37 ˆκ 47,078 322,370 69,359 384,580

We compare the performance of the following ﬁve algorithms: (i) Newton: the standard Newton s method, (ii) Uniform: SSN with uniform sampling, (iii) PLev SS: SSN with partial leverage scores sampling, (iv) RNorm SS: SSN with block (row) norm squares sampling, and (v) LBFGS-k is the standard L-BFGS method [14] with history size k.

All algorithms are initialized with a zero vector.3 We also use CG to solve the sub-problem approximately to within 10 6 relative residue error. In order to compute the relative error wt w / w , an estimate of w is obtained by running the standard Newton s method for sufﬁciently long time. Note here, in SSN with partial leverage score sampling, we recompute the leverage scores every 10 iterations. Roughly speaking, these stale leverage scores can be viewed as approximate leverage scores for the current iteration with approximation quality that can be upper bounded by the change of the Hessian and such quantity is often small in practice. So reusing the leverage scores allows us to further drive down the running time.

We ﬁrst investigate the effect of the condition number, controlled by varying λ, on the performance of different methods, and the results are depicted in Figure 1. It can be seen that in well-conditioned cases, all sampling schemes work equally well. However, as the condition number worsens, the performance of uniform sampling deteriorates, while non-uniform sampling, in particular leverage score sampling, shows a great degree of robustness to such ill-conditioning effect. The experiments shown in Figure 1 are consistent with the theoretical results of Table 2, showing that the theory presented here can indeed be a reliable guide to practice.

3Theoretically, the suitable initial point for all the algorithms is the one with which the standard Newton s method converges with a unit stepsize. Here, w0 = 0 happens to be one such good starting point.

log(lambda) -6 -5 -4 -3 -2 -1 0

condition number

(a) condition number

log(lambda) -6 -5 -4 -3 -2 -1 0

best sampling size

Newton Uniform PLev SS RNorm SS

(b) sampling size

log(lambda) -6 -5 -4 -3 -2 -1 0

running time (s)

Newton Uniform PLev SS RNorm SS LBFGS-50

(c) running time

Figure 1: Ridge logistic regression on Adult with different λ s: (a) local condition number κ, (b) sample size for different SSN methods giving the best overall running time, (c) running time for different methods to achieve 10 8 relative error.

Next, we compare the performance of various methods as measured by relative-error of the solution vs. running time and the results are shown in Figure 24. It can be seen that, in most cases, SSN with non-uniform sampling schemes outperforms the other algorithms, especially Newton s method. In particular, uniform sampling scheme performs poorly, e.g., in Figure 2(b), when the problem exhibits a high non-uniformity among data points which is reﬂected in the difference between κ and κ shown in Table 3.

time (s) 0 2 4 6 8 10

||w - w*||2/||w*||2

100 logistic - lambda=0.01

Newton Uniform (7700) PLev SS (3850) RNorm SS (3850) LBFGS-100 LBFGS-50

(a) CT Slice

time (s) 0 2 4 6 8 10

||w - w*||2/||w*||2

100 logistic - lambda=0.01

Newton Uniform (27500) PLev SS (3300) RNorm SS (3300) LBFGS-100 LBFGS-50

time (s) 0 0.5 1 1.5 2

||w - w*||2/||w*||2

100 logistic - lambda=0.01

Newton Uniform (24600) PLev SS (2460) RNorm SS (2460) LBFGS-100 LBFGS-50

time (s) 0 2 4 6 8 10

||w - w*||2/||w*||2

100 logistic - lambda=0.01

Newton Uniform (39000) PLev SS (1560) RNorm SS (1560) LBFGS-100 LBFGS-50

Figure 2: Iterate relative solution error vs. time(s) for various methods on four datasets with ridge penalty parameter λ = 0.01. The values in brackets denote the sample size used for each method.

We would like to remind the reader that for the locally strongly convex problems that we consider here, one can provably show that the behavior of the error in the loss function, i.e., F(wk) F(w )/|F(w )| follows the same pattern as that of the solution error, i.e., wk w / w ; see [23] for details. As a result, our algorithms remain to be effective for cases where the primary goal is to reduce the loss (as opposed to the solution error).

5 Conclusions

In this paper, we propose non-uniformly sub-sampled Newton methods with inexact update for a class of constrained problems. We show that our algorithms have a better dependence on the condition number and enjoy a lower per-iteration complexity, compared to other similar existing methods. Theoretical advantages are numerically demonstrated.

Acknowledgments. We would like to thank the Army Research Ofﬁce and the Defense Advanced Research Projects Agency as well as Intel, Toshiba and the Moore Foundation for support along with DARPA through MEMEX (FA8750-14-2-0240), SIMPLEX (N66001-15-C-4043), and XDATA (FA8750-12-2-0335) programs, and the Ofﬁce of Naval Research (N000141410102, N000141210041 and N000141310129). Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of DARPA, ONR, or the U.S. government.

[1] Naman Agarwal, Brian Bullins, and Elad Hazan. Second order stochastic optimization in linear time. ar Xiv preprint ar Xiv:1602.03943, 2016.

4For each sub-sampled Newton method, the sampling size is determined by choosing the best value from {10d, 20d, 30d, ..., 100d, 200d, 300d, ..., 1000d} in the sense that the objective value drops to 1/3 of initial function value ﬁrst.

[2] Jock A Blackard and Denis J Dean. Comparative accuracies of artiﬁcial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture, 24(3):131 151, 1999.

[3] Sébastien Bubeck. Theory of convex optimization for machine learning. ar Xiv preprint ar Xiv:1405.4980, 2014.

[4] Richard H Byrd, Gillian M Chin, Will Neveitt, and Jorge Nocedal. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM Journal on Optimization, 21(3):977 995, 2011.

[5] Ron S Dembo, Stanley C Eisenstat, and Trond Steihaug. Inexact Newton methods. SIAM Journal on Numerical Analysis, 19(2):400 408, 1982.

[6] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast approximation of matrix coherence and statistical leverage. The Journal of Machine Learning Research, 13(1):3475 3506, 2012.

[7] Murat A Erdogdu and Andrea Montanari. Convergence rates of sub-sampled Newton methods. In Advances in Neural Information Processing Systems, pages 3034 3042, 2015.

[8] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics Springer, Berlin, 2001.

[9] Franz Graf, Hans-Peter Kriegel, Matthias Schubert, Sebastian Pölsterl, and Alexander Cavallaro. 2d image registration in ct images using radial image descriptors. In Medical Image Computing and Computer Assisted Intervention MICCAI 2011, pages 607 614. Springer, 2011.

[10] John T. Holodnak and Ilse C. F. Ipsen. Randomized approximation of the Gram matrix: Exact computation and probabilistic bounds. SIAM J. Matrix Analysis Applications, 36(1):110 137, 2015.

[11] François Kawala, Ahlame Douzal-Chouakria, Eric Gaussier, and Eustache Dimert. Prédictions d activité dans les réseaux sociaux en ligne. In 4ième conférence sur les modèles et l analyse des réseaux: Approches mathématiques et informatiques, page 16, 2013.

[12] Brian Kulis. Metric learning: A survey. Foundations and Trends in Machine Learning, 5(4):287 364, 2012.

[13] M. Lichman. UCI machine learning repository, 2013.

[14] Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. 45:503 528, 1989.

[15] Michael W Mahoney. Randomized Algorithms for Matrices and Data. Foundations and Trends in Machine Learning. NOW Publishers, Boston, 2011. Also available at ar Xiv:1104.5557v2.

[16] James Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 735 742, 2010.

[17] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.

[18] Mert Pilanci and Martin J Wainwright. Newton sketch: A linear-time optimization algorithm with linear-quadratic convergence. ar Xiv preprint ar Xiv:1505.02250, 2015.

[19] Farbod Roosta-Khorasani and Michael W Mahoney. Sub-Sampled Newton Methods I: Globally Convergent Algorithms. ar Xiv preprint ar Xiv:1601.04737, 2016.

[20] Farbod Roosta-Khorasani and Michael W Mahoney. Sub-Sampled Newton Methods II: Local Convergence Rates. ar Xiv preprint ar Xiv:1601.04738, 2016.

[21] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267 288, 1996.

[22] Oriol Vinyals and Daniel Povey. Krylov subspace descent for deep learning. ar Xiv preprint ar Xiv:1111.4259, 2011.

[23] Peng Xu, Jiyan Yang, Farbod Roosta-Khorasani, Christopher Ré, and Michael W Mahoney. Sub-sampled Newton methods with non-uniform sampling. ar Xiv preprint ar Xiv:1607.00559, 2016.