# quasinewton_methods_for_saddle_point_problems__742f7c55.pdf

Quasi-Newton Methods for Saddle Point Problems

Chengchang Liu Department of Computer Science & Engineering The Chinese University of Hong Kong 7liuchengchang@gmail.com

Luo Luo School of Data Science Fudan University luoluo@fudan.edu.cn

This paper studies quasi-Newton methods for strongly-convex-strongly-concave saddle point problems. We propose random Broyden family updates, which have explicit local superlinear convergence rate of O((1 1/(dκ2))k(k 1)/2), where d is the dimension of the problem, κ is the condition number and k is the number of iterations. The design and analysis of proposed algorithm are based on estimating the square of indefinite Hessian matrix, which is different from classical quasi Newton methods in convex optimization. We also present two specific Broyden family algorithms with BFGS-type and SR1-type updates, which enjoy the faster local convergence rate of O((1 1/d)k(k 1)/2). Our numerical experiments show proposed algorithms outperform classical first-order methods.

1 Introduction

In this paper, we focus on the following smooth saddle point problem

min x Rdx max y Rdy f(x, y), (1)

where f is strongly-convex in x and strongly-concave in y. We target to find the saddle point (x , y ) which holds that f(x , y) f(x , y ) f(x, y )

for all x Rdx and y Rdy. This formulation is widely used in game theory [1, 39], AUC maximization [15, 43], fairness-aware machine learning [44], robust optimization [2, 11, 13, 36], empirical risk minimization [46] and reinforcement learning [10].

There are a great number of first-order optimization algorithms for solving problem (1), including extragradient method [18, 38], optimistic gradient descent ascent [7, 28], proximal point method [31] and dual extrapolation [24]. These algorithms iterate with first-order oracle and achieve linear convergence. Lin et al. [21], Wang and Li [40] used Catalyst acceleration to reduce the complexity for unbalanced saddle point problem, nearly matching the lower bound of first-order algorithms [26, 45] under some specific assumptions. Compared with first-order methods, second-order methods usually enjoy superior convergence in numerical optimization. Huang et al. [16] extended cubic regularized Newton (CRN) method [23, 24] to solve saddle point problem (1), which has quadratic local convergence. However, each iteration of CRN requires accessing the exact Hessian matrix and solving the corresponding linear systems. These steps arise O d3 x + d3 y time complexity, which is too expensive for high dimensional problems.

Quasi-Newton methods [3 5, 8, 37] are popular ways to avoid accessing exact second-order information applied in standard Newton methods. They approximate the Hessian matrix based on the Broyden family updating formulas [3], which significantly reduces the computational cost. These algorithms

The corresponding author

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Table 1: We summarize the convergence behaviors of proposed algorithms for solving saddle point problem in the view of gradient norm λk+k0 def = f(zk+k0) after (k + k0) iterations, where

d def = dx + dy is dimensions of the problem and κ is the condition number. The results come from 3.18 and the upper bound holds with high probability at least 1 δ.

Algorithms Upper Bound of λk+k0 k0

Broyden (Alg. 5) 1 1 dκ2+1 k(k 1)/2 1

2 k 1 1 4κ2 k0 O dκ2 ln dκ

BFGS/SR1 (Alg. 6 and 7) 1 1 d+1 k(k 1)/2 1

2 k 1 1 4κ2 k0 O (d + κ2) ln dκ

are well studied for convex optimization. The famous quasi-Newton methods including Davidon Fletcher-Powell (DFP) method [8, 12], Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [4, 5, 37] and symmetric rank 1 (SR1) method [3, 8] enjoy local superlinear convergence [6, 9, 29] when the objective function is strongly-convex. Recently, Rodomanov and Nesterov [32, 33, 34] proposed greedy and random variants of quasi-Newton methods, which first achieves non-asymptotic superlinear convergence. Later, Lin et al. [20] established a better convergence rate which is condition-number-free. Jin and Mokhtari [17], Ye et al. [42] showed the non-asymptotic superlinear convergence rate also holds for classical DFP, BFGS and SR1 methods.

In this paper, we study quasi-Newton methods for saddle point problem (1). Note that when the Hessian matrix of the objective function is indefinite, the existing Broyden family update formulas and their convergence analysis cannot be applied directly. To overcome this issue, we propose a variant framework of random quasi-Newton methods for saddle point problems, which approximates the square of the Hessian matrix during the iteration. Our theoretical analysis characterizes the convergence rate by the gradient norm, rather than the weighted norm of gradient used in convex optimization [17, 20, 32 34, 42]. We summarize the theoretical results for proposed algorithms in Table 1. The local convergence behaviors for all of the algorithms have two periods. The first period has k0 iterations with a linear convergence rate O((1 1/κ2)k0). The second one enjoys superlinear convergence, that is,

For general Broyden family methods, we have O 1 1/(dκ2 + 1) k(k 1)/2 .

For BFGS method and SR1 method, we have O 1 1/(d + 1) k(k 1)/2 , which is condition-number-free.

Paper Organization In Section 2, we state the notation and preliminaries of this paper. In Section 3, we first propose random quasi-Newton methods for quadratic saddle point problem which enjoys local superlinear convergence. Then we extend it to solve general strongly-convex-strongly-concave saddle point problems. In Section 5, we provide numerical experiments to validate our algorithms on popular machine learning models. All proofs, experiment details and extensions are deferred to appendix.

2 Notation and Preliminaries

We use to present spectral norm and Euclidean norm of matrix and vector respectively. We denote the standard basis for Rd by {e1, . . . , ed} and let I be the identity matrix. The trace of a square matrix is denoted by tr( ). Given two positive definite matrices G and H, we define their inner

product as G, H def = tr(GH).

We introduce the quantity σH(G) def = H 1, G H = H 1, G d, which is used to measure how well does matrix G approximate matrix H. If we further suppose G H, then according to Rodomanov and Nesterov [32] it holds that G H H 1, G H H = σH(G)H.

Using the notation of problem (1), we let z = [x; y] Rd where d def = dx + dy and denote the gradient and Hessian matrix of f at (x, y) as g(z) Rd and ˆH(z) Rd d.

We suppose the saddle point problem (1) satisfies the following assumptions.

Assumption 2.1. The objective function f(x, y) is twice differentiable and has L-Lipschitz continuous gradient and L2-Lipschitz continuous Hessian , i.e., there exists constants L > 0 and L2 > 0 such that for any z = [x; y], z = [x ; y ] Rd, we have g(z) g(z ) L z z and ˆH(z) ˆH(z ) L2 z z .

Assumption 2.2. The objective function f(x, y) is twice differentiable, µ-strongly-convex in x and µ-strongly-concave in y, i.e., there exists constant µ > 0 such that 2 xxf(x, y) µI and 2 yyf(x, y) µI for any (x, y) Rdx Rdy.

The L-Lipschitz continuous of gradient means the spectral norm of Hessian matrix ˆH(z) can be upper bounded, that is ˆH(z) L. Additionally, the condition number of the objective function is

defined as κ def = L/µ.

3 Quasi-Newton Methods for Saddle Point Problems

In this section, we focus on designing quasi-Newton methods for saddle point problem and showing their superlinear local convergence rate. The update rule of standard Newton s method can be written as z+ = z ˆH(z) 1g(z) for solving problem (1). It has quadratic local convergence, but takes O(d3) time complexity per iteration. For convex minimization, quasi-Newton methods including BFGS/SR1 [3 5, 8, 37] and their variants [20, 32, 33, 42] focus on approximating the Hessian which can reduce the computational cost to O d2 for each round. However, all these algorithms and related convergence analysis are based on the assumption that the Hessian matrix is positive definite, which is not suitable for our saddle point problems since ˆH(z) is indefinite.

We introduce the auxiliary matrix H(z) be the square of Hessian H(z) def = ˆH(z) 2. The following lemma shows that H(z) is positive definite.

Lemma 3.1. Under Assumption 2.1 and 2.2, we have µ2I H(z) L2I for all z Rd.

Hence, we can reformulate the update of Newton s method by

z+ =z [( ˆH(z))2] 1 ˆH(z)g(z)

=z H(z) 1 ˆH(z)g(z). (2)

Then it is natural to characterize the second-order information by estimating the auxiliary matrix H(z), rather than the indefinite Hessian ˆH(z). If one can obtain a symmetric positive definite matrix G Rd d as an estimator for H(z), the update rule of (2) can be approximated by

z+ = z G 1 ˆH(z)g(z). (3)

The remainder of this section introduce strategies to construct G, resulting the quasi-Newton methods for saddle point problem with local superlinear convergence. We should point out the implementation of iteration (3) is unnecessary to construct Hessian matrix ˆH(z) explicitly, since we are only interested in the Hessian-vector product ˆH(z)g(z) which can be computed efficiently [27, 35].

3.1 The Broyden Family Updates

We first review some basic results for quasi-Newton methods in convex optimization. We introduce the Broyden family [25, Section 6.3] of quasi-Newton updates for approximating an positive definite matrix H Rd d by using the information of current estimator G Rd d.

Definition 3.2. Suppose two positive definite matrices H Rd d and G Rd d satisfy H G. For any u Rd, if Gu = Hu, we define Broydτ(G, H, u) def = H. Otherwise, we define

Broydτ(G, H, u) def = (1 τ) G (G H)uu (G H)

+ τ G Huu G + Guu H

u Hu + u Gu

u Hu + 1 Huu H

The different choices of parameter τ for above formula contain several popular quasi-Newton updates:

For τ = u Hu/ u Gu [0, 1], it corresponds to the BFGS update

BFGS(G, H, u) def = G Guu G

u Gu + Huu H

For τ = 0, it corresponds to the SR1 update

SR1(G, H, u) def = G (G H)uu (G H)

u (G H)u . (6)

Now we introduce the update rule [20, 32] by choosing u as

u (0, I) or u Unif Sd 1 . (7)

The following lemma shows applying the update rule (4) with (7) leads to a new estimator with tighter error bound in the measure of σH( ). Lemma 3.3 (Modified from Lin et al. [20, Theorem 3.1]). Suppose two positive definite matrices µ2I H L2I and G Rd d satisfy that H G. Let G+ = Broydτ (G, H, u)), where u is chosen random method as (7), then for any τ [0, 1], we have E [σH(G+)] 1 1/(dκ2) σH(G)

For specific Broyden family updates BFGS and SR1 shown in (5) and (6), the update can achieve a better convergence result. Concretely, for BFGS method, we first find L such that G 1 = L L, where L is an upper triangular matrix. This step can be implemented with O(d2) complexity [20, Proposition 1]. We present the subroutine for factorizing G 1 in Algorithm 1 and give its detailed implementation in the appendix. And we use the direction L u instead of u for the BFGS update. Applying the BFGS update rule (5) with formula (7), we obtain a condition-number-free result as follows. Lemma 3.4 (Modified from Lin et al. [20, Theorem 4.2]). Suppose two positive definite matrices H Rd d and G Rd d satisfy H G. Let G+ = BFGS G, H, L u , where u is chosen by the random method in (7) and L is an upper triangular matrix such that G 1 = L L. Then we have E [σH(G+)] (1 1/d) σH(G).

Remark 3.5. Note that the step of conducting G 1 = L L requires QR decomposition of rank-1 change matrix which requires O(26d2) flops [14, Section 12.5.1]. We do not recommend using this BFGS update strategy in practice when n is large.

The convergence of the SR1 update can be characterized by the measure τH(G) def = tr(G H). Applying the SR1 update rule (6) with formula (7), the convergence also holds a condition-numberfree result. Lemma 3.6 (Modified from Lin et al. [20, Theorem 4.1]). Suppose two positive definite matrices H Rd d and G Rd d satisfy H G. Let G+ = SR1 (G, H, u), where u is chosen by the random method in (7). Then we have E [τH (G+)] (1 1/d) τH(G).

3.2 Algorithms for Quadratic Saddle Point Problems

We now solve simple quadratic saddle point problem that f(x, y) = 1

2z Az b z in (1) where z = [x; y], b Rd, A Rd d is symmetric and d = dx + dy. We suppose A could be partitioned

as A = Axx Axy Ayx Ayy

where the sub-matrices Axx Rdx dx, Axy Rdx dy, Ayx Rdy dx

Algorithm 1 Fast-Chol(H, L, u)

1: Input: positive definite matrix H Rd d, upper triangular matrix L Rd, direction u Rd

2: Using QR-decomposition to obtain [Q, R] = QR L I Huu /(u Hu)

3: Calculate v = u/

u Hu and [Q , R ] = QR( v R )

4: Output: ˆL = R

Algorithm 2 Random-Broyden-Quadratic

1: Input: z0 Rd, G0 = L2I and τk [0, 1]

2: for k = 0, 1, . . .

3: zk+1 = zk G 1 k ˆHg(zk)

4: Randomly choose uk from (7) and update Gk+1 = Broydτk (Gk, H, uk)

and Ayy Rdy dy satisfy Axx µI, Ayy µI and A L. Using notations introduced in

Section 2, we have z = [x; y], g(z) = Az b, ˆH def = ˆH(z) = A and H def = H(z) = A2.

We present the detailed procedure of random quasi-Newton methods for quadratic saddle point problem by using the Broyden family update, BFGS update and SR1 update in Algorithm 2, 3 and 4 respectively.

We define λk as the gradient norm at zk for our convergence analysis, that is λk def = g(zk) . The definition of λk in this paper is different from the measure used in convex optimization [20, 32]2, but it also holds the similar property as follows.

Lemma 3.7. Assume we have ηk 1 and Gk Rd d such that H Gk ηk H for Algorithm 2-4, then we have λk+1 (1 1/ηk) λk.

The next theorem states the assumptions of Lemma 3.7 always holds with ηk = κ2 1, which means λk converges to 0 linearly.

Theorem 3.8. For all k 0, Algorithm 2, 3, 4 hold that λk 1 1 κ2 k λ0 and H Gk κ.

Lemma 3.7 also implies superlinear convergence can be obtained if there exists ηk which converges to 1. Applying Lemma 3.3, 3.4 and 3.6, we can show it holds for proposed algorithms. Theorem 3.9. Solving quadratic saddle point problem by the proposed quasi-Newton Algorithms, for all k 0, we have:

1. For the Broyden family method (Algorithm 2), we have E [λk+1/λk] 1 1/(dκ2) k dκ2.

2. For the BFGS method (Algorithm 3), we have E [λk+1/λk] (1 1/d)k dκ2.

3. For the SR1 method (Algorithm 4), we have E [λk+1/λk] (1 k/d) dκ4.

Combining the results of Theorem 3.8 and 3.9, we achieve the two-stages convergence behavior, that is, the algorithm has the global linear convergence and local superlinear convergence. We leave the formal description in appendix.

3.3 Algorithms for General Saddle Point Problems

In this section, we consider the general saddle point problem where f(x, y) in (1) satisfies Assumption 2.1 and 2.2. We propose quasi-Newton methods for solving the problem with local superlinear convergence and O(d2) time complexity for each iteration.

2In later section, we will see the measure λk def = gk is suitable to convergence analysis of quasi-Newton methods for saddle point problems.

Algorithm 3 Random-BFGS-Quadratic

1: Input: z0 Rd, G0 = L2I, L0 = L 1I 2: for k = 0, 1 . . . 3: zk+1 = zk G 1 k ˆHg(zk) 4: Randomly choose uk from (7) 5: uk = L k uk 6: Gk+1 = BFGS (Gk, H, uk) 7: Lk+1 = Fast-Chol(H, Lk, uk)

Algorithm 4 Random-SR1-Quadratic

1: Input: z0 Rd, G0 = L2I K d + 1

2: for k = 0, 1 . . . , K

3: zk+1 = zk G 1 k ˆHg(zk)

4: Randomly choose uk from (7)

5: Gk+1 = SR1 Gk, H, uk

3.3.1 Algorithms

The key idea of designing quasi-Newton methods for saddle point problems is approximating the auxiliary matrix H(z) def = ˆH(z) 2 to characterize the second-order information. Since the Hessian of f is Lipschitz continuous and bounded by Assumption 2.1 and 2.2, which means the auxiliary matrix operator H(z) is also Lipschitz continuous. Lemma 3.10. Under Assumption 2.1 and 2.2, we have H(z) is 2LL2-Lipschitz continuous.

Combining Lemma 3.1 and 3.10, we achieve the following properties of H(z), which analogize the strongly self-concordance in convex optimization [32]. Lemma 3.11. Under Assumption 2.1 and 2.2, for all z, z , w Rd, the auxiliary matrix operator H( ) satisfies H(z) H(z ) M z z H(w), where M = 2κ2L2/L.

Corollary 3.12. Let z, z Rd and r = z z . Suppose the objective function f satisfies Assumption 2.1 and 2.2, then for all z, z Rd and M = 2κ2L2/L, the auxiliary matrix operator H( ) holds that

H(z) (1 + Mr) H(z ) (1 + Mr)H(z)

Different from the quadratic case, the auxiliary matrix H(z) is not fixed for general saddle point problem. Based on the smoothness of H(z), we apply Corollary 3.12 to generalize Lemma 3.13 as follows. Lemma 3.13. Let z Rd and G Rd d be a positive definite matrix such that H(z) G ηH(z) for some η 1. In addition, define z+ Rd and r = z+ z , then for all u Rd, τ [0, 1] and M = 2κ2L2/L, we have

H(z+) Broydτ( G, H(z+), u) (1 + Mr)2ηH(z+),

where G def = (1 + Mr)G H(z+).

Lemma 3.13 implies it is reasonable to have the algorithms using Gk+1 = Broydτk( Gk, Hk+1, uk) with Gk = (1 + Mrk)Gk and rk = zk+1 zk . Similarly, we can also achieve Gk+1 by such Gk for specific BFGS and SR1 update. Combining this with iteration (3), we propose several quasi-Newton methods for general strongly-convex-strongly-concave saddle point problems. The details are shown in Algorithm 5, 6 and 7 for Broyden family, BFGS and SR1 updates respectively.

3.3.2 Convergence Analysis

Let us consider the convergence guarantee for algorithms proposed in Section 3.3.1. We introduce the following notations to simplify the presentation. We let {zk} be the sequence generated from Algorithm 5, 6 or 7 and denote

gk def = g(zk), rk def = zk+1 zk , ˆHk def = ˆH(zk) and Hk def = ˆH(zk) 2.

We still use gradient norm λk def = f(zk) for analysis and establish the relationship between λk and λk+1, which is shown in Lemma 3.14.

Algorithm 5 Random-Broyden-General

1: Input: z0 Rd, G0 H, τk [0, 1] and M 0.

2: for k = 0, 1 . . .

3: zk+1 = zk G 1 k ˆHkgk

4: Compute rk = zk+1 zk and Gk = (1 + Mrk)Gk

5: Randomly choose uk from (7) and update Gk+1 = Broydτk( Gk, Hk+1, uk).

Algorithm 6 Random-BFGS-General

1: Input: z0 Rd, G0 H, M 0. 2: L0 = G 1/2 0 3: for k = 0, 1 . . . 4: zk+1 = zk G 1 k ˆHkgk 5: rk = zk+1 zk 6: Gk = (1 + Mrk)Gk 7: Lk = Lk/ 1 + Mrk 8: Randomly choose uk from (7) 9: Gk+1 = BFGS( Gk, Hk+1, Lkuk) 10: Lk+1 = Fast-Chol(Hk+1, Lk, Lkuk) 11: end for

Algorithm 7 Random-SR1-General

1: Input: z0 Rd, G0 H and M 0

2: for k = 0, 1 . . .

3: zk+1 = zk G 1 k ˆHkgk

4: rk = zk+1 zk

5: Gk = (1 + Mrk)Gk

6: Randomly choose uk from (7)

7: Gk+1 = SR1( Gk, Hk+1, uk)

Lemma 3.14. Using Algorithm 5, 6 and 7, suppose we have Hk Gk ηk Hk, for some ηk 1 and let β = L2/(2µ2), then we have

λk + βλ2 k and rk λk

Rodomanov and Nesterov [32, Lemma 4.3] derive a result similar to Lemma 3.14 for minimizing the strongly-convex function ˆf( ) on the different measure λ ˆ f( ) def = ˆf( ), 2 ˆf( ) 1 ˆf( ) .3

Note that our algorithms are based on the iteration rule zk+1 = zk G 1 k ˆHkgk. Compared with quasi-Newton methods for convex optimization, there exists an additional term ˆHk between G 1 k and gk, which leads to the fact that the convergence analysis based on λ ˆ f(zk) is difficult. Fortunately, we find using gradient norm λk directly makes the analysis achievable.

For further analysis, we also denote σk def = σHk(Gk) = H 1 k , Gk d.

Then we establish the linear convergence for the first period of iterations, which can be viewed as the extension of Theorem 3.8. Note that the following result holds for Algorithm 5, 6 and 7; and it does not depend on the choice of uk. Theorem 3.15. Using Algorithm 5, 6 and 7 by G0 = L2I and M = 2κ2L2/L, suppose the initial point z0 is sufficiently close to z such that Mλ0/µ ln b/(4bκ2) with 1 < b < 5, then we have λk 1 1/2bκ2 k λ0 and Hk Gk exp 2 Pk 1 i=0 ρi κ2Hk bκ2Hk for all k 0,

where ρk def = Mλk/µ.

We now analyze how σk changes after one iteration to show the local superlinear convergence for the Broyden family method (Algorithm 5) and the BFGS method (Algorithm 6). Recall that σk is defined to measure how well does matrix Gk approximate Hk. Lemma 3.16. Solving the general strongly-convex-strongly-concave saddle point problem (1) under Assumption 2.1 and 2.2 by our quasi-Newton algorithms and supposing the sequences G0, , Gk generated by the Algorithm 5 and 6 are given, then we have the following results for σk for all k:

3The original notations of Rodomanov and Nesterov [32] is minimizing the strongly-convex function f( ), to avoid ambiguity, we use notations ˆf( ) and λ ˆ f to describe their work in this paper.

1. The random Broyden family method (Algorithm 5) holds that

E [σk+1] 1 1/(dκ2) (1 + Mrk)2 (σk + 2d Mrk/(1 + Mrk)) .

2. The random BFGS method (Algorithm 6) holds that

E [σk+1] (1 1/d) (1 + Mrk)2 (σk + 2d Mrk/(1 + Mrk)) .

The analysis for the SR1 method is based on constructing ηk such that ηk = tr(Gk Hk)/tr(Hk) and the technical details are showed in appendix. Based on Lemma 3.16, one can show that our algorithms enjoy the local superlinear convergence for the general saddle point problems.

Theorem 3.17. Solving general saddle point problem (1) under Assumption 2.1 and 2.2 by proposed random quasi-Newton methods (Algorithm 5, 6 and 7) by G0 = L2I and M = 2κ2L2/L, we have the following results for all k > 0:

1. For the random Broyden family method (Algorithm 5), if λ0 satisfies that Mλ0

µ ln 2 8(1+2d)κ2 ,

we have E [λk+1/λk] 1 1/(dκ2) k 2dκ2.

2. For the random BFGS method (Algorithm 6), if λ0 satisfies that Mλ0

µ ln 2 8(1+d)κ2 , we have

E [λk+1/λk] (1 1/d)k 2dκ2.

3. For the random SR1 method (Algorithm 7), if λ0 satisfies that Mλ0

µ ln 2 8(1+2dκ2)κ2 , we have

E [λk+1/λk] (1 1/d)k 2dκ4.

Finally, combining the results of Theorem 3.15 and 3.17, we can prove the algorithms achieve the two-stages convergence behaviors as follows.

Corollary 3.18. Solving the general saddle point problem (1) under Assumption 2.1 and 2.2 by our random quasi-Newton methods (Algorithm 5, 6 and 7) with G0 = L2I and M = 2κ2L2/L, if the initial point is sufficiently close to the saddle point such that Mλ0/µ ln 2/(8κ2), then with probability 1 δ for any δ (0, 1), we have the following results:

1. The random Broyden family method (Algorithm 5) holds that

λk0+k 1 1 dκ2 + 1

for all k0 = O dκ2 ln(dκ/δ) and k 0.

2. The random BFGS/SR1 method (Algorithm 6/7) holds that

λk0+k 1 1 d + 1

for all k0 = O max d, κ2 ln(dκ/δ) and k 0.

4 Discussion

The relationship between our quasi-Newton methods (Algorithm 5, 6 and 7) and existing first-order method for minimax problem is similar to the one for minimization problem. For our BFGS and SR1 methods, the dependency on κ2 only appears on the first period of linear convergence, which matches the convergence rate of gradient descent ascent. As an analogy, minimizing strongly-convex function by quasi-Newton methods [20, 32] has dependency on κ in the first period of linear convergence, which matches the convergence rate of gradient descent.

Our quasi-Newton methods also can be extended for solving non-linear equations, and the two-stage convergence behavior still hold. We provide the detailed discussion and the comparison with related work [19, 41] in Appendix F.

0 2000 4000 6000

Ra SR1-Q Ra BFGSv1-Q Ra BFGSv2-Q EG

0 2000 4000

Ra SR1-Q Ra BFGSv1-Q Ra BFGSv2-Q EG

0 10000 20000 30000

Ra SR1-Q Ra BFGSv1-Q EG

(a) a9a (iteration) (b) w8a (iteration) (c) sido0 (iteration)

Ra SR1-Q Ra BFGSv1-Q Ra BFGSv2-Q EG

0 50 100 150

Ra SR1-Q Ra BFGSv1-Q Ra BFGSv2-Q EG

0 2500 5000 7500

Ra SR1-Q Ra BFGSv1-Q EG

(d) a9a (time) (e) w8a (time) (f) sido0 (time)

Figure 1: We demonstrate iteration numbers vs. g(z) 2 and CPU time (second) vs. g(z) 2 for AUC model on datasets a9a (d = 126, n = 32561), w8a (d = 303, n = 45546) and sido (d = 4935, n = 12678).

5 Numerical Experiments

In this section, we conduct the experiments on machine learning applications of AUC Maximization and adversarial debiasing to verify our theory. We refer to Algorithm 2 and 5 with parameter τk = u Hk+1/(u Gku) as Ra BFGSv1-Q and Ra BFGSv1-G; refer to Algorithm 3, 4, 6 and 7 as Ra BFGSv2-Q, Ra SR1-Q, Ra BFGSv2-G and Ra SR1-G respectively. We compare these proposed algorithms with classical first-order method extragradient (EG) [18, 38].

5.1 AUC Maximization

AUC maximization [15, 43] aims to find the classifier w Rm on the training set {ai, bi}n i=1 where ai Rm and bi {+1, 1}. We denote n+ be the number of positive instances and p = n+/n. The minimax formulation for AUC maximization can be written as

min x Rm+2 max y R f(x, y) def = 1

i=1 fi(x, y; ai, bi, λ),

where x = [w; u; v] Rm+2, λ > 0 is the regularization parameter and fi is defined as

fi(x, y; ai, bi, λ) =λ

2 x 2 2 p(1 p)y2 + p((w T ai v)2 + 2(1 + y)w T ai)Ibi= 1

+ (1 p)((w T ai u)2 2(1 + y)w T ai)Ibi=1.

The objective function of AUC maximization is quadratic, hence we conduct the algorithms in Section 3.2 (Algorithm 2, 3 and 4) for this model. We set λ = 100/n and evaluate all algorithms on three real-world datasets a9a , w8a and sido0 . The dimension of the problem is d = m + 3. The results of iteration numbers against g(z) 2 and CPU time against g(z) 2 are presented in Figure 1. These results show that our algorithms perform better than the EG method.

5.2 Adversarial Debiasing

Adversarial learning [22, 44] can be used on fairness-aware machine learning issues. Give the training set {ai, bi, ci}n i=1, where ai Rd contains all input variables, bi R is the output and ci R is the input variable which we want to protect and make it unbiased. Our experiments are based on the

0 10000 20000

Ra SR1-G Ra BFGSv1-G Ra BFGSv2-G EG

0 10000 20000

Ra SR1-G Ra BFGSv1-G Ra BFGSv2-G EG

0 10000 20000 30000

Ra SR1-G Ra BFGSv1-G EG

(a) adults (iteration) (b) law school (iteration) (c) bank market (iteration)

Ra SR1-G Ra BFGSv1-G Ra BFGSv2-G EG

0 500 1000 1500

Ra SR1-G Ra BFGSv1-G Ra BFGSv2-G EG

0 20000 40000

Ra SR1-G Ra BFGSv1-G EG

(d) adults (time) (e) law school (iteration) (f) bank market (time)

Figure 2: We demonstrate iteration numbers vs. g(z) 2 and CPU time (second) vs. g(z) 2 for adversarial debiasing model on datasets adults (d = 123, n = 32561), law school (d = 380, n = 20427) and bank market (d = 3880, n = 45211).

fairness-aware binary classification dataset adult , bank market and law school [30], leading to bi, ci {+1, 1}. The model is formulated by the minimax problem

min x Rm max y R 1 n

i=1 (l1(ai, bi, x) βl2(a i x, ci, y)) + λ x 2 γy2,

where l1, l2 are the logit functions: logit(a, b, c) = log(1 + exp( ba c)). We set the parameters β, λ and γ as 0.5, 10 4 and 10 4 respectively. The dimension of the problem is d = m + 1. Since the objective function is non-quadratic, we conduct the proposed algorithms in Section 3.3 (Algorithm 5, 6 and 7) here. We use extragradient as warm up to achieve the local condition for proposed algorithms. The results of iteration numbers against g(z) 2 and CPU time against g(z) 2 are presented in Figure 2, which indicate that our algorithms significantly outperform the baseline algorithm.

6 Conclusion

In this work, we propose quasi-Newton methods for solving strongly-convex-strongly-concave saddle point problems. We characterize the second-order information by approximating the square of Hessian matrix, which avoids the issue of dealing with the indefinite Hessian directly. We present the explicit local superlinear convergence rates for Broyden s family update and faster convergence rates for two specific methods: SR1 and BFGS updates. However, our algorithms still require to compute the exact gradient and store the approximate Hessian by O(d2) space complexity. In future work, it is interesting to design the stochastic algorithms to further reduce the computational cost of the iteration. It is also possible to design the limited-memory quasi-Newton methods for more scalable saddle point problems.

Acknowledgements

We would like to thank the anonymous reviewer for pointing out the mistakes in our previous proof. We would also like to thank Tong Zhang and John C.S. Lui for giving useful suggestions. This work is supported by National Natural Science Foundation of China (No. 62206058) and Shanghai Sailing Program (22YF1402900).

[1] Tamer Bacsar and Geert Jan Olsder. Dynamic noncooperative game theory. SIAM, 1998.

[2] Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust optimization. Princeton university press, 2009.

[3] Charles G. Broyden. Quasi-Newton methods and their application to function minimisation. Mathematics of Computation, 21(99):368 381, 1967.

[4] Charles G. Broyden. The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal of Applied Mathematics, 6(1):76 90, 1970.

[5] Charles G. Broyden. The convergence of a class of double-rank minimization algorithms: 2. the new algorithm. IMA journal of applied mathematics, 6(3):222 231, 1970.

[6] Charles G. Broyden, J. E. Dennis, and Jorge J. Moré. On the local and superlinear convergence of quasi-Newton methods. IMA Journal of Applied Mathematics, 12(3):223 245, 1973.

[7] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training GANs with optimism. In ICLR, 2018.

[8] William C. Davidon. Variable metric method for minimization. SIAM Journal on Optimization, 1(1):1 17, 1991.

[9] J. E. Dennis, Jr., and Jorge J. Moré. A characterization of superlinear convergence and its application to quasi-Newton methods. Mathematics of Computation, 28(126):549 560, 1974.

[10] Simon S. Du, Jianshu Chen, Lihong Li, Lin Xiao, and Dengyong Zhou. Stochastic variance reduction methods for policy evaluation. In ICML, 2017.

[11] John C. Duchi and Hongseok Namkoong. Variance-based regularization with convex objectives. Journal of Machine Learning Research, 20(68):1 55, 2019.

[12] Roger Fletcher and Micheal J.D. Powell. A rapidly convergent descent method for minimization. The Computer Journal, 6:163 168, 1963.

[13] Rui Gao and Anton J. Kleywegt. Distributionally robust stochastic optimization with wasserstein distance. ar Xiv preprint ar Xiv:1604.02199, 2016.

[14] Gene H. Golub and Charles F. Van Loan. Matrix computations, 1996.

[15] James A. Hanley and Barbara J. Mc Neil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1):29 36, 1982.

[16] Kevin Huang, Junyu Zhang, and Shuzhong Zhang. Cubic regularized Newton method for saddle point models: a global and local convergence analysis. ar Xiv preprint ar Xiv:2008.09919, 2020.

[17] Qiujiang Jin and Aryan Mokhtari. Non-asymptotic superlinear convergence of standard quasi Newton methods. ar Xiv preprint ar Xiv:2003.13607, 2020.

[18] G. M. Korpelevich. An extragradient method for finding saddle points and for other problems. Matecon, 12:747 756, 1976.

[19] Dachao Lin, Haishan Ye, and Zhihua Zhang. Explicit superlinear convergence rates of Broyden s methods in nonlinear equations. ar Xiv preprint ar Xiv:2109.01974, 2021.

[20] Dachao Lin, Haishan Ye, and Zhihua Zhang. Explicit convergence rates of greedy and random quasi-Newton methods. ar Xiv preprint ar Xiv:2104.08764, 2021.

[21] Tianyi Lin, Chi Jin, and Michael I. Jordan. Near-optimal algorithms for minimax optimization. In COLT, 2020.

[22] Daniel Lowd and Christopher Meek. Adversarial learning. In KDD, 2005.

[23] Yurii Nesterov. Accelerating the cubic regularization of newton s method on convex problems. Mathematical Programming, 112(1):159 181, 2008.

[24] Yurii Nesterov and Laura Scrimali. Solving strongly monotone variational and quasi-variational inequalities. Discrete and Continuous Dynamical Systems, 31(4):1383 1396, 2007.

[25] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer, New York, NY, USA, second edition, 2006.

[26] Yuyuan Ouyang and Yangyang Xu. Lower complexity bounds of first-order methods for convex-concave bilinear saddle-point problems. ar Xiv preprint:1808.02901, 2018.

[27] Barak A. Pearlmutter. Fast exact multiplication by the Hessian. Neural computation, 6(1): 147 160, 1994.

[28] L.D. Popov. A modification of the arrow-hurwicz method for search of saddle points. Mathematical Notes of the Academy of Sciences of the USSR, 28(5):845 848, 1980.

[29] Micheal J.D. Powell. On the convergence of the variable metric algorithm. IMA Journal of Applied Mathematics, 7(1):21 36, 1971.

[30] Tai Le Quy, Arjun Roy, Vasileios Iosifidis, and Eirini Ntoutsi. A survey on datasets for fairness-aware machine learning, 2021.

[31] R. Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877 898, 1976.

[32] Anton Rodomanov and Yurii Nesterov. Greedy quasi-Newton methods with explicit superlinear convergence. SIAM Journal on Optimization, 31(1):785 811, 2021.

[33] Anton Rodomanov and Yurii Nesterov. New results on superlinear convergence of classical quasi-Newton methods. Journal of optimization theory and applications, 188(3):744 769, 2021.

[34] Anton Rodomanov and Yurii Nesterov. Rates of superlinear convergence for classical quasi Newton methods. Mathematical Programming, pages 1 32, 2021.

[35] Nicol N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural Computation, 14(7):1723, 2002.

[36] Soroosh Shafieezadeh-Abadeh, Peyman Mohajerin Esfahani, and Daniel Kuhn. Distributionally robust logistic regression. ar Xiv preprint ar Xiv:1509.09259, 2015.

[37] David F. Shanno. Conditioning of quasi-Newton methods for function minimization. Mathematics of computation, 24(111):647 656, 1970.

[38] Paul Tseng. On linear convergence of iterative methods for the variational inequality problem. Journal of Computational and Applied Mathematics, 60(1-2):237 252, 1995.

[39] John Von Neumann and Oskar Morgenstern. Theory of games and economic behavior. Princeton university press, 2007.

[40] Yuanhao Wang and Jian Li. Improved algorithms for convex-concave minimax optimization. In Neur IPS, 2020.

[41] Haishan Ye, Dachao Lin, and Zhihua Zhang. Greedy and random Broyden s methods with explicit superlinear convergence rates in nonlinear equations. ar Xiv preprint ar Xiv:2110.08572, 2021.

[42] Haishan Ye, Dachao Lin, Zhihua Zhang, and Xiangyu Chang. Explicit superlinear convergence rates of the SR1 algorithm. ar Xiv preprint ar Xiv:2105.07162, 2021.

[43] Yiming Ying, Longyin Wen, and Siwei Lyu. Stochastic online AUC maximization. NIPS, 2016.

[44] Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. In AIES, 2018.

[45] Junyu Zhang, Mingyi Hong, and Shuzhong Zhang. On lower iteration complexity bounds for the saddle point problems. ar Xiv preprint:1912.07481, 2019.

[46] Yuchen Zhang and Lin Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. Journal of Machine Learning Research, 18(1):2939 2980, 2017.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 6.

(c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See Section 2. (b) Did you include complete proofs of all theoretical results? [Yes] See Appendix A-D. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We present them in the supplemental material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] We present them in Section 5 and in Appendix E. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix E. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] See Section 5 and Appendix E. (b) Did you mention the license of the assets? [Yes] See Section 5 and Appendix E, the data is public available. (c) Did you include any new assets either in the supplemental material or as a URL? [Yes]

See Appendix E. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]