# stochastic_optimization_for_nonconvex_infprojection_problems__439a4d2a.pdf Stochastic Optimization for Non-convex Inf-Projection Problems Yan Yan 1 Yi Xu 2 Liijun Zhang 3 Xiaoyu Wang 4 Tianbao Yang 1 In this paper, we study a family of non-convex and possibly non-smooth inf-projection minimization problems, where the target objective function is equal to minimization of a joint function over another variable. This problem include difference of convex (DC) functions and a family of bi-convex functions as special cases. We develop stochastic algorithms and establish their first-order convergence for finding a (nearly) stationary solution of the target non-convex function under different conditions of the component functions. To the best of our knowledge, this is the first work that comprehensively studies stochastic optimization of non-convex inf-projection minimization problems with provable convergence guarantee. Our algorithms enable efficient stochastic optimization of a family of non-decomposable DC functions and a family of bi-convex functions. To demonstrate the power of the proposed algorithms we consider an important application in variance-based regularization. Experiments verify the effectiveness of our inf-projection based formulation and the proposed stochastic algorithm in comparison with previous stochastic algorithms based on the min-max formulation for achieving the same effect. 1. Introduction In this paper, we consider a family of non-convex and possibly non-smooth problems with the following structure min x X F(x) := {g(x) + min y dom(h) h(y) y, ℓ(x) }, (1) where X Rd is a closed convex set, g : X R is lowersemicontinuous, h : dom(h) R is uniformly convex, ℓ: 1University of Iowa 2DAMO Academy, Alibaba Group 3Nanjing University 4The Chinese University of Hong Kong (Shenzhen). Correspondence to: Yan Yan , Tianbao Yang . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). X Rm is a lower-semicontinuous differentiable mapping, and , is the inner product. The requirement of uniform convexity on h is to ensure the inner minimization problem is well defined and its solution is unique (cf. Section 2). Define f(x, y) = g(x) + h(y) y, ℓ(x) , the objective function F(x) is called the inf-projection of f(x, y) in the literature. When g is convex, depending on dom(h), the two subfamilies of above problem (1) deserve more discussion: difference of convex (DC) and bi-convex functions. DC functions. When g is convex and dom(h) Rm + and ℓ is convex 1, the inf-projection minimization problem (1) is equivalent to the following DC functions, min x X F(x) = n g(x) h (ℓ(x)) o , (2) where h denotes the convex conjugate function of h, the convexity of the second component h (ℓ(x)) is following the composition rule of convexity (Boyd & Vandenberghe, 2004) 2. Minimizing DC functions has wide applications in machine learning and statistics (Nitanda & Suzuki, 2017; Kiryo et al., 2017). Although stochastic algorithms for DC problems have been considered recently (Nitanda & Suzuki, 2017; Xu et al., 2018a; Thi et al., 2017), working with the inf-projection minimization (1) is preferred when h (ℓ(x)) is non-decomposable such that an unbiased stochastic gradient of h (ℓ(x)) is not easily accessible as that of y, ℓ(x) in (1). Inspired by this scenario, let us particularly consider an important instance variance-based regularization. It refers to a learning paradigm that minimizes the empirical loss and its variance simultaneously, by which a better biasvariance trade-off may be achieved (Maurer & Pontil, 2009). To give a condensed understanding of its connection to the inf-projection formulation, we can re-formulate the problem (cf. the details and comparison with a related convex objective of (Namkoong & Duchi, 2017) in Section 5): min x X 1 n i=1 li(x) + λ 1 i=1 (li(x))2 λ i=1 li(x) 2 , where li(x) : X R+ is the loss function of a model x on the i-th example and λ > 0 is a regularization pa- 1dom(h) Rm and ℓis concave can be transferred to the considered case by a variable change. 2Note that h is monotonically increasing iff dom(h) Rm + . Stochastic Optimization for Non-convex Inf-Projection Problems Table 1. Summary of results for finding a (nearly) ϵ-stationary solution in this work under different conditions of g, h and ℓ. SM means smoothness, Lip. means Lipschitz continuous, Diff means differentiable, MO means monotonically increasing or decreasing for h , CVX means convex, and UC means p-uniformly convex (p 2), v = 1/(p 1). g h (h ) ℓ Alg. Mini-Batch Compl. SM UC & simple SM & Lip MSPG (Section 3) Yes O(1/ϵ4/v) SM & CVX UC (MO) Diff & Lip & CVX St-SPG (Section 4) No O(1/ϵ4/v) Lip & CVX UC (MO) SM & Lip & CVX St-SPG (Section 4) No O(1/ϵ4/v) rameter. The above problem (3) is a special case of (2) by regarding g(x) as the sum of the first two terms, ℓ(x) = 1 n Pn i=1 li(x) and h (s) = λ 2 s2. By noting the con- vex conjugate 1 2 1/n Pn i=1 li(x) 2 = miny 0 1 y, 1/n Pn i=1 li(x) , the above problem can be viewed as a special case of (1). In this way, computing a stochastic gradient of f(x, y) in terms of x can be done based on one sampled loss function li(x). It is easier than computing an unbiased stochastic gradient of h (ℓ(x)) in (3) that requires at least two sampled loss functions. Bi-convex functions. When g is convex and dom(h) Rm and ℓis convex, the inf-projection minimization problem (1) reduces to minimization of a bi-convex function. In particular, f(x, y) is convex in terms of x for every fixed y dom(h) and f(x, y) is convex in terms of y for every fixed x X. The concerned family of bi-convex functions also find some applications in machine learning and computer vision (Kumar et al., 2010; Shah et al., 2016). For example, the self-paced learning method proposed by (Kumar et al., 2010) needs to solve the following bi-convex problem minw Rd,v {0,1}n r(w) + Pn i=1 vifi(w) 1 K Pn i=1 vi, where r and fi are convex in w, vi = 0 if fi(w) > 1 K and vi = 1 if fi(w) < 1 K , which can be covered by (1). Although deterministic optimization methods (e.g., proximal alternating linearized minimization and its variants (Bolte et al., 2014; Davis et al., 2016)) and their convergence theory have been studied for minimizing a bi-convex function (Gorski et al., 2007), algorithms and convergence theory for stochastic optimization of a bi-convex function remains under-explored especially when we are interested in the convergence respect to the target function F(x). A special case that belongs to both DC and Bi-convex functions is when ℓ(x) = Ax, and dom(h) can be any convex set. A naive idea to tackle (1) is by alternating minimization or block coordinate descent, i.e., alternatively solving the inner minimization problem over y given x and then updating x by certain approaches (e.g., stochastic gradient descent) (Bolte et al., 2014; Davis et al., 2016; Hong et al., 2015; Xu & Yin, 2013; Driggs et al., 2020). However, this approach suffers from two issues: (i) solving the inner minimization might not be a trivial task (e.g., solving the inner minimization problem related to (3) requires passing n examples once); (ii) the target objective function F(x) is not necessarily a smooth function or a convex function, which makes the convergence analysis challenging. Additionally, their convergence analysis focus on f(x, y) instead of F(x). In this paper, the main question that we tackle is: how to design efficient stochastic algorithms using simple updates for both x and y to enjoy a provable convergence guarantee in terms of finding a stationary point of F(x)? Our contributions are summarized below: First, we consider the case when g and ℓare smooth but not necessarily convex and h is a simple function whose proximal mapping is easy to compute. Under the condition that ℓis Lipschitz continuous, we prove the convergence of mini-batch stochastic proximal gradient method (MSPG) with increasing mini-batch size that employ parallel stochastic gradient updates for x and y, and establish the convergence rate. Second, we consider the cases when g and ℓare not necessarily smooth but convex, and h is not necessarily a simple function (corresponding to DC and bi-convex functions). We develop an algorithmic framework that employs a suitable stochastic algorithm for solving strongly convex functions in a stagewise manner. We analyze the convergence rates for finding a (nearly) stationary point when employing the stochastic proximal gradient (SPG) method at each stage, resulting St-SPG. The complexity results of our algorithms under different conditions of g, h and ℓare shown in Table 1. The novelty and significance of our results are (i) this is the first work that comprehensively studies the stochastic optimization of a non-smooth non-convex inf-projection problem; (ii) the application of the inf-projection formulation to variance-based regularization demonstrates much faster convergence of our algorithms comparing with existing algorithms based on a min-max formulation. Stochastic Optimization for Non-convex Inf-Projection Problems 2. Preliminaries Let us first present some notations. We let denote the Euclidean norm of a vector and the spectral norm of a matrix. We use ξ to denote some random variable. Given a function g : Rd R, we denote the Fr echet subgradients and limiting Fr echet gradients by ˆ g and g respectively, i.e., at x, ˆ g(x) = {y Rd : limx x inf g(x) g(x ) y and g(x) = {y Rd : xk g x, vk ˆ g(xk), vk v}. Here xk g x represents xk x and g(xk) g(x). When the function g is differentiable, the subgradients (ˆ g and g ) reduce to the standard gradient g. It is known that ˆ g(x) g(x), ˆ g(x) = { g(x)} if g(x) is differential and g(x) = { g(x)} if g(x) is continuously differential. We denote by xg(x, y) the partial derivative in the direction of x and g(x, y) = ( xg(x, y), yg(x, y)) . In this paper, we will prove the convergence in terms of the limiting gradient. But all results can be extended to the Fr echet subgradients. Let ℓ(x) Rm d denote the Jacobian matrix of the differentiable mapping ℓ(x). ℓis said Gℓ-Lipschitz continuous if ℓ(x) Gℓ. A differentiable function f( ) has (L, v)- H older continuous gradient if f(x1) f(x2) L x1 x2 v holds for some v (0, 1] and L > 0. When v = 1, it is known as L-smooth function. If f is H older continuous, then it holds f(x1) f(x2) f(x2), x1 x2 + L 1+v x1 x2 1+v. A related condition is uniform convexity. A function f( ) is (ϱ, p)-uniformly convex where p 2, if f(x1) f(x2) f(x2), x1 x2 + ϱ 2 x1 x2 p 2. When p = 2, it is known as strong convexity. If f is (ϱ, p)- uniformly convex, then the following inequality holds ϱ x1 x2 p f(x1) f(x2), x1 x2 f(x1) f(x2) x1 x2 . (4) It is obvious that a uniformly convex function has a unique minimizer. If f is uniformly convex, then its convex conjugate f has H older continuous gradient and vice versa, which is summarized in the following lemma. Lemma 1. Let f be differentiable and f be (L, v)-H older continuous where v (0, 1]. Then f is (ϱ, p)-uniformly convex with p = 1 + 1 v and ϱ = 2v 1+v(1/L) 1 v . If f is (ϱ, p)- uniformly convex, then f has (L, v)-H older continuous gradient with L = (1/ϱ)1/(p 1) and v = 1/(p 1). Next we discuss the convergence measure for the considered inf-projection problem. Let fh(x) = miny h(y) y ℓ(x). If h is uniformly convex, let y (x) = arg miny h(y) y ℓ(x) denote the unique minimizer. In this way, under a regularity condition that h(y) y ℓ(x) is level-bounded in y uniformly in x, then Theorem 10.58 of (Rockafellar & Wets, 2009) implies fh(x) = ℓ(x) y (x). Then F(x) = g(x) ℓ(x) y (x) under the smoothness or Algorithm 1 MSPG 1: Input: initialized x1, y1. 2: for t = 1, . . . , T do 3: Compute mini-batch stochastic partial gradients e xf (t) 0 = 1 mt Pmt i=1 xf0(xt, yt; ξi) and e yf (t) 0 = 1 mt Pmt i=1 yf0(xt, yt; ξi) 4: xt+1 = ΠX[xt η e xf (t) 0 ] 5: yt+1 = Pηh[yt η e yf (t) 0 ] 6: end for 7: Output: wτ = (xτ, yτ), where τ {1, . . . , T} is randomly sampled convexity condition of g, which allows us to connect F(x) by g(x) ℓ(x) y (x). In this paper, we aim to find a solution that is ϵ-stationary or nearly ϵ-stationary of F, which are defined as follows. Definition 1. A solution x satisfying dist(0, F(x)) ϵ is called an ϵ-stationary point of F. A solution x is called a nearly ϵ-stationary if there exists z and a constant c > 0 such that z x cϵ and dist(0, F(z)) ϵ. Particularly, nearly stationarity has been used to measure the convergence for non-smooth non-convex optimization in the literature (Davis & Grimmer, 2017; Davis & Drusvyatskiy, 2018b;a; Chen et al., 2018; Xu et al., 2018a). Before ending this section, we state basic assumptions below. For simplicity, here all variance bounds are denoted by σ2. Additional conditions regarding g, h and ℓare presented in individual theorems. Assumption 1. For the problem (1) we assume: (i) h has (Lh , v)-H older continuous gradient, and ℓis continuously differentiable; (ii) Let g(x; ξg) denote a stochastic gradient of g(x). If g(x) is smooth, assume E[ g(x; ξg) g(x) 2] σ2, otherwise assume E[ g(x; ξ) 2] σ2 for x X; (iii) Let ℓ(x; ξℓ) denote a stochastic version of ℓ(x) and assume E[ ℓ(x; ξℓ) ℓ(x) 2] σ2. If ℓ(x) is smooth, assume E[ ℓ(x; ξℓ) ℓ(x) 2] σ2, otherwise assume E[ ℓ(x; ξℓ) 2] σ2 for x X; (iv) Let h(y; ξh) denote a stochastic gradient of h(y) and assume E[ h(y; ξh) 2] σ2 for y dom(h); (v) maxx X,y dom(h) f(x, y) minx X,y dom(h) f(x, y) M. 3. Mini-batch Stochastic Gradient Methods For Smooth Functions In this section, we consider the case when g and ℓare smooth functions but not necessarily convex. Please note that the target function F is still not necessarily smooth and is nonconvex. We assume h is simple such that its proximal map- Stochastic Optimization for Non-convex Inf-Projection Problems ping defined by Pηh[ˆy] = arg miny h(y) + 1 2η y ˆy 2 is easy to compute. Let f0(x, y) = g(x) y ℓ(x). The key idea of the our first algorithm is that we treat f(x, y) = f0(x, y) + h(y) + IX(x) as a function of the joint variable w = (x, y), which consists of a smooth component f0 and a non-smooth component h(y) + IX(x). Hence, we can employ mini-batch stochastic proximal gradient (MSPG) method to minimize f(x, y) based on stochastic gradients of f0(w) denoted by f0(w; ξ) for a random variable ξ. The detailed steps of MSPG are shown in Algorithm 1. At each iteration, stochastic partial gradients w.r.t. x and y are computed and used for updating. Although the convergence of MSPG for f(w) has been considered in literature of composite optimization (Ghadimi et al., 2016) or alternating minimization (Hong et al., 2015; Xu & Yin, 2013; Driggs et al., 2020), there is still a gap when applying the existing convergence result, since we are interested in the convergence analysis of dist(0, F(x)), rather than f0(w). In the following, we fill this gap by four main steps. In brief, first, we establish the joint smoothness of f0 in (x, y) by Lemma 2. Then, based on Lemma 2, we derive the convergence of dist(0, f(w)) in Proposition 1. Next, Lemma 5 connects dist(0, f(w)) to dist(0, F(x)). Finally, the convergence of dist(0, F(x)) is achieved (Theorem 2). Lemma 2. Suppose g(x) is Lg smooth, ℓis Gℓ-Lipschitz continuous and Lℓ-smooth, and maxy dom(h) y Dy. Then f0(x, y) is smooth over (x, y) X dom(h), i.e., xf0(x, y) xf0(x , y ) 2 2 + yf0(x, y) yf0(x , y ) 2 2 L2( x x 2 2 + y y 2 2), where L = q max(2L2g + 4L2 ℓD2y + G2 ℓ, 4G2 ℓ). Based on the joint smoothness of f0 in (x, y), we can establish the convergence of MSPG in terms of dist(0, f(xτ, yτ)) in the following proposition. Note that this convergence result in terms of dist(0, f(xτ, yτ)) is stronger than that in (Ghadimi et al., 2016) in terms of proximal gradient, which follows the analysis in (Xu et al., 2019). Proposition 1. Under the same conditions as in Lemma 2 and suppose the stochastic gradient has bounded variance E[ f0(w; ξ) f0(w) 2] σ2 0, run MSPG with η = c L (0 < c < 1 2) and a sequence of mini-batch sizes mt = b(t + 1) for t = 0, . . . , T 1, where b > 0 is a constant, then the output wτ of Algorithm 1 satisfies E[dist(0, f(wτ))2] c1σ2 0(log(T) + 1) where c1 = 2c(1 2c)+2 c(1 2c) and c2 = 6 4c The next lemma establishes the relation between dist(0, f(wτ)) and dist(0, F(xτ)), allowing us to bridge the convergence of dist(0, F(xτ)) by employing that of dist(0, f(wτ)). Lemma 3. Under the same conditions as in Lemma 2 and h has (Lh , v)-H older continuous gradient. Then for any ( x, y) X dom(h), we have dist(0, F( x)) xf( x, y) 2 + Gℓ (1 + v) v Lh dist(0, yf( x, y))v. Finally, combining the above results, we can state the main result in this section regarding the convergence of MSPG in terms of the concerned dist(0, F(xτ)) as follows. Theorem 2. Suppose the same conditions as in Lemma 2 and Assumption 1 hold. Algorithm 1 guarantees that E[dist(0, F(xτ))2] O(1/T v). To ensure E[dist(0, F(xτ))] ϵ, we can set T = O(1/ϵ2/v). The total complexity is O(1/ϵ4/v). 4. Stochastic Algorithms for Non-Smooth Functions In this section, we consider the case when g or ℓare not necessarily smooth but are convex. We also assume h is monotonic, i.e., dom(h) Rm + or dom(h) Rm . In the former case, the objective function belongs to DC functions, and in the latter case the objective function belongs to Bi-Convex functions. Please note that the target function F is still not necessarily convex and is non-smooth. The proposed algorithm is inspired by the stagewise stochastic DC algorithm proposed in (Xu et al., 2018a) but with some major changes. Let us first briefly discuss the main idea and logic behind the proposed algorithm. There are two difficulties that we need to tackle: (i) non-smoothness and non-convexity in terms of x, (ii) minimization over y. To tackle the first issue, let us assume the optimal solution y (x) = arg miny h(y) y ℓ(x) given x is available. Then the problem regarding x becomes: min x X g(x) y (x) ℓ(x) (5) When dom(h) Rm + (corresponding to a DC function), the above problem is still non-convex. In order to obtain a provable convergence guarantee, we consider the following strongly convex problem from some γ > 0 and x0 X, whose objective function is an upper bound of the function in (5) at x0: P(x0) = arg min x X n g(x) y x0 (ℓ(x0) + ℓ(x0)(x x0)) + γ 2 x x0 2o . (6) Stochastic Optimization for Non-convex Inf-Projection Problems Note P(x0) is uniquely defined due to strong convexity. If x0 = P(x0) it can be shown that x0 is the critical point of F(x), i.e., 0 F(x0) = g(x0) ℓ(x0) y (x0). Then we can iteratively solve the fixed-point problem x = P(x) until it converges. When dom(h) Rm (corresponding to a Bi-convex function), we can simply consider the following strongly convex problem: P(x0) = arg min x X g(x) y (x0) ℓ(x) + γ A remaining issue in the above approach is that y (x0) is assumed available, which is related to the second issue mentioned above. It may not be easy to obtain an exact minimizer y (x0) given a x0. To this end, we can employ an iterative stochastic algorithm to optimize miny h(y) y ℓ(x0) approximately given x0, and obtain an inexact solution ˆy(x0) such that h(ˆy(x0)) ˆy(x0) ℓ(x0) h(y (x0)) y (x0) ℓ(x0) ε for some approximation error ε. Then, we combine these two pieces together, i.e., replacing y (x0) in the definition of P(x0) with ˆy(x0), and employing a stochastic algorithm to solve the fixed-point equation by x ˆP(x), where ˆP(x) is an approximation of P(x). Therefore, we have two sources of approximation error one from using ˆyx instead of y (x) and another one from solving the minimization problem of x inexactly. Our analysis is to show that with well-controlled approximation error, we can still achieve provable convergence guarantee. For the sake of presentation, let us first introduce some important notations by considering different conditions of DC and bi-convex functions. For the k-th stage of St-SPG, define f k x(x) =g(x) y k (ℓ(xk) + ℓ(xk)(x xk)), for dom(h) Rm +, f k x(x) =g(x) y k ℓ(x), for dom(h) Rm . A stochastic gradient of f k x(x) can be computed by g(x; ξg) ℓ(xk; ξℓ) yk for dom(h) Rm + or g(x; ξg) ℓ(x; ξℓ) yk for dom(h) Rm . For both conditions, let f k y (y) = h(y) y ℓ(xk+1), Rk x(x) = γ 2 x xk 2, Rk y(y) = µ A stochastic gradient of f k y (y) can be computed by h(y; ξh) ℓ(xk+1; ξ ℓ), where ξg, ξℓ, ξh, ξ ℓdenote independent random variables. The proposed algorithm is shown in Algorithm 2 named St-SPG, which employs SPG in Algorithm 3 to solve the Algorithm 2 St-SPG 1: Initialize x1 X, y1 dom(h) 2: Set a sequence of integers T x k , T y k and numbers γ, µ 3: for k = 1, . . . , K do 4: xk+1 = SPG(f k x, Rk x, xk, X, T x k , γ) 5: yk+1 = SPG(f k y , Rk y, yk, dom(h), T y k , µ) 6: end for Algorithm 3 SPG(f, R, z1, Ω, T, γ) 1: Set ηt according to γ 2: for t = 1, . . . , T do 3: zt+1 = arg minz Ω f(zt; ξt) z +R(z)+ 1 2ηt z zt 2 4: end for 5: Output ˆz T = PT t=1 tzt PT t=1 t subproblems of x and y in a stagewise manner. x and y share the same update method SPG, so we can summarize it in general notations. To this end, let us consider the convergence of SPG for solving H(z) = f(z)+R(z), where f(z) is a convex function and R(z) = γ 2 z z1 2 is a strongly convex function. Its convergence has been considered in many previous works. Here, we adopt the results derived in (Xu et al., 2018a) to establish the convergence of St-SPG under different conditions of g and ℓas follows. Proposition 2. Let H(z) = f(z) + R(z) where R(z) = γ 2 z z1 2 is γ-strongly convex. If f(z) is L-smooth and E[| f(z; ξ) f(z)|2] σ2 and γ 3L, then by setting ηt = 3/(γ(t + 1)) SPG guarantees that E[H(ˆz T ) H(z )] 4γ z z1 2 3T(T + 3) + 6σ2 If f is non-smooth with E[| f(z; ξ) 2] σ2, then by setting ηt = 4/(γt) SPG guarantees that E H(ˆz T ) H(z ) γ z z1 2 4T(T + 1) + 17σ2 where z = arg minz ΩH(z). With the above proposition, we can apply the above convergence guarantee of SPG for f k x(x) + Rk x(x) and f k y (y) + Rk y(y). Then define vk and uk as the optimal solutions to the subproblems of x and y at the k-th stage, respectively: vk = arg min x X f k x(x) + Rk x(x), uk = arg min y f k y (y) + Rk y(y). We can establish the following result regarding the convergence of St-SPG related to fixed-point convergence (xτ+1 xτ), and also the minimization error of P(x), Stochastic Optimization for Non-convex Inf-Projection Problems i.e., xτ+1 vτ , for a randomly sampled index τ {1, . . . , K}. We have boundedness assumptions on y and ℓ below to guarantee the boundedness of the second moment of stochastic gradients, which can be implied by assuming the domain X is a compact set and dom(h) is bounded. Theorem 3. Suppose Assumption 1 holds, and max( yk 2, E[ ℓ(xk+1; ξ) 2]) D2 for k {1, . . . , K}. There exists a constant G = 17 max{2σ2 + 2D2σ2, 2σ2 + 2D2}, and for any constants γ > 0, µ > 0, α 1 Algorithm 2 with T y k = k/γ + 1, T t k = k/µ + 1 guarantees that the following inequalities hold: 1 2E[ xτ+1 vτ 2] E[ xτ vτ 2 + xτ+1 xτ 2] 4(M + 2G2)(α + 1) γK 1 2E[ yτ+1 uτ 2] E[ yτ uτ 2 + yτ+1 yτ 2] 4(M + 2G2)(α + 1) for τ sampled by P(τ = k) = kα PK s=1 sα . The lemma below connects F(xk) (or dist(0, F(xk))) to the quantities in Theorem 3, by which we can derive the convergence of (nearly) stationary point. Lemma 4. Suppose g is Lg-smooth, and ℓis Gℓ-Lipschitz continuous. Then for any k we have F(xk) (γ + Lg) xk vk + Gℓ yk uk + Gℓµv 1 + v v Lh uk yk v + Gv+1 ℓ 1 + v v Lh xk+1 xk v. Suppose g is non-smooth, and ℓis Gℓ-Lipschitz continuous and Lℓ-smooth and maxy dom(h) y D, then for any k we have dist(0, F(vk)) (γ + DLℓ) xk vk + Gℓ yk uk v Lh µ yk uk + Gℓ xk+1 zk v . Combining Lemma 4 and Theorem 3, we have the following corollaries regarding the convergence of St-SPG under different conditions of g and ℓ. Corollary 4. Suppose g is Lg-smooth and ℓis Gℓ-Lipschitz continuous and both are convex. Under the same conditions as in Theorem 3, we have E[dist(0, F(xτ))] ϵ after K = O(ϵ 2 v ) stages. Therefore, the total iteration complexity is PK k=1(T x k + T y k ) = O(ϵ 4 Corollary 5. Suppose g is non-smooth and convex, ℓis Gℓ-Lipschitz continuous and Lℓ-smooth and convex, and maxy dom(h) y D. Under the same conditions as in Theorem 3, we have E[dist(0, F(vτ))] ϵ and E[ xτ vτ ] O(ϵ1/v) after K = O(ϵ 2 v ) stages. Therefore, the total iteration complexity is PK k=1(T x k + T y k ) = O(ϵ 4 Remark: Our algorithms enjoy the same iteration complexity of that in (Xu et al., 2018a) for DC functions when v is unknown or v = 1, but we do not assume a stochastic gradient of h (ℓ(x)) is easily computed. It is also notable that St-SPG doest not need the knowledge of v to run. Finally, we would like to mention that the SPG algorithm for solving subproblems in Algorithm 2 can be replaced by other suitable stochastic optimization algorithms for solving a strongly convex problem similar to the developments in (Xu et al., 2018a) for minimizing DC functions. For example, one can use adaptive stochastic gradient methods in order to enjoy an adaptive convergence, and one can use variance reduction methods if the involved functions are smooth and have a finite-sum structure to achieve an improved convergence. 5. Application for Variance Regularization Table 2. Data statistics. Datasets #Examples #Features #pos:#neg a9a 32,561 123 0.3172:1 covtype 581,012 54 1.0509:1 RCV1 697,641 47,236 1.1033:1 URL 2,396,130 3,231,961 0.4939:1 In this section, we consider the application of the proposed algorithms for variance-based regularization in machine learning. Let l(θ, z) R+ denote a loss of model θ Θ on a random data z. A fundamental task in machine learning is to minimize the expected risk R(θ) = Ez[l(θ, z)]. However, in practice one has to find an approximate model based on sampled data Sn = {z1, . . . , zn}. An advanced learning theory according to Bennett s inequality bounds the expected risk by (Maurer & Pontil, 2009): i=1 l(θ, zn) + c1 Var(ℓ(θ, z)) where c1 and c2 are constants. This motivates the variancebased regularization approach (Maurer & Pontil, 2009): min θ Θ 1 n i=1 l(θ, zn) + λ Varn(θ, Sn) where Varn(θ, Sn) = 1 n Pn i=1(ℓ(θ, zi) ln(θ)])2 is the empirical variance of loss, ln(θ) is the average of empirical loss, and λ > 0 is a regularization parameter. However, the above formulation does not favor efficient stochastic algorithms. To tackle the optimization problem Stochastic Optimization for Non-convex Inf-Projection Problems for variance-based regularization, (Namkoong & Duchi, 2017) proposed a min-max formulation based on distributionally robust optimization, given below and proposed stochastic algorithms for solving the resulting min-max formulation when the loss function is convex (Namkoong & Duchi, 2016), min θ Θ max P n{ i=1 Piℓ(θ, Xi)] : Dφ(P|| ˆPn) ρ}, (9) where ρ > 0 is a hyper-parameter, n = {P Rn; P 0, Pn i=1 Pi = 1}, ˆPn = (1/n, . . . , 1/n), and Dφ(P||Q) = R φ( d P d Q)d Q is called the φ-divergence based on φ(t) = 1 2(t 1)2. The min-max formulation is convex and concave when the loss function is convex. Nevertheless, the stochastic optimization algorithms proposed for solving the minmax formulation are not scalable. The reason is that it introduces an n-dimensional dual variable P that is restricted on a probability simplex. As a result, the per-iteration cost could be dominated by updating the dual variable that scales as O(n), which is prohibitive when the training set is large. Although one can use a special structure and a stochastic coordinate update on P to reduce the per-iteration cost to O(log(n)) (Namkoong & Duchi, 2016), the iteration complexity could be still blowed up by a factor up to n due to the variance in the stochastic gradient on P. As a potential solution to addressing the scalability issue, we consider the following reformulation: i=1 l(θ, zn) + λ Varn(θ, Sn) = min α>0 1 n i=1 l(θ, zn) + λ Varn(θ, Sn) = min α>0 1 n i=1 l(θ, zn) + λ α 1 n Pn i=1(ℓ(θ, zi))2 (Ei[ℓ(θ, zi)])2 In practice, one usually needs to tune the regularization parameter λ in order to achieve the best performance. As a result, we can further simplify the problem by absorbing α into the regularization parameter λ and end up with the following formulation by noting 1 2s2 = maxy 0 1 2y2 ys for s 0: min θ Θ 1 n i=1 l(θ, zi) + λ 1 i=1 (l(θ, zi))2 + λ{min y 0 1 2y2 y 1 i=1 l(θ, zi)}. (11) It is notable that the above formulation only introduces one additional scalable variable y R+, though the problem might become a non-convex problem of θ. However, when the loss function l(θ, z) itself is a non-convex function, the min-max formulation (9) also losses its convexity, which makes our inf-projection formulation more favorable. We conduct experiments to verify the efficacy of the infprojection formulation and proposed stochastic algorithms in comparison to the stochastic algorithms for solving minmax formulation (9). We perform two experiments on four datasets, i.e., a9a, RCV1, covtype and URL from the libsvm website, whose number of examples are n = 32561, 581012, 697641 and 2396130, respectively (Table 2). For each dataset, we randomly sample 80% as training data and the rest as testing data. We evaluate training error and testing error of our algorithms and baselines versus cpu time. In the first experiment, we use (convex) logistic loss for l(θ, zi) in our inf-projection formulation (11) and min-max formulation (9). We compare our St-SPG with the stochastic algorithm Bandit Mirror Descend (BMD) proposed in (Namkoong & Duchi, 2016). We implement two versions of BMD, one using the standard mirror descent method to update the dual variable P and the other (denoted by BMDeff) exploiting binary search tree (BST) to update the P. To this end, it needs to use a modified constraint on P, i.e., P {p Rn +|pi δ/n, n2/2 p 1/n 2 ρ} (see Sec. 4 in (Namkoong & Duchi, 2016)). We tune hyper-parameters from a reasonable range, i.e., for St-SPG, λ {10 5:2}, γ, µ {10 3:3}. For BMD and BMD-eff, we tune step size ηP {10 8: 15} for updating P, step size ηθ {10 5:3} for updating θ, ρ {n 10 3:3} and fix δ = 10 5. Training and testing errors against cpu time (s) of the three algorithms on four datasets are reported in Figure 1. In the second experiment, we use (non-convex) truncated logistic loss in (11) and (9). In particular, the truncated loss function is given by φ(l(θ, zi)) = α log(1 + l(θ, zi)/α), where l is logistic loss and we set α = 10n as suggested in (Xu et al., 2018b). Since the loss is non-convex, we compare MSPG with proximally guided stochastic mirror descent (PGSMD) (Rafique et al., 2018) and its efficient variant (denoted by PGSMD-eff) for solving the min-max formulation that is non-convex and concave, where the efficient variant is implemented with the same modified constraint on P and BST as BMD-eff. For MSPG, we tune λ {10 5:2}, the step size parameter c in Proposition 1 from {10 5:2}. Hyper-parameters of PGSMD and PGSMDeff including ηP , ηθ, ρ and δ are selected in the same range as in the first experiment. The weak convexity parameter ρwc are chosen from {10 5:5}. Training and testing errors against cpu time (s) of the three algorithms on four datasets are reported in Figure 2. We can observe two conclusions from the results of both experiments. First, the training and testing errors from solving the inf-projection formulation (11) converge to a Stochastic Optimization for Non-convex Inf-Projection Problems 0 50 100 150 200 cpu time train error BMD BMD-eff St-SPG 0 500 1000 1500 2000 2500 train error BMD BMD-eff St-SPG 0 2000 4000 6000 8000 cpu time train error BMD BMD-eff St-SPG 0 500 1000 cpu time train error BMD BMD-eff St-SPG 0 50 100 150 200 cpu time BMD BMD-eff St-SPG 0 500 1000 1500 2000 2500 BMD BMD-eff St-SPG 0 2000 4000 6000 8000 cpu time BMD BMD-eff St-SPG 0 500 1000 cpu time BMD BMD-eff St-SPG Figure 1. Results of variance-based regularization with (convex) logistic loss. 0 50 100 cpu time train error a9a, truncated PGSMD PGSMD-eff MSPG 0 500 1000 1500 2000 2500 train error covtype, truncated PGSMD PGSMD-eff MSPG 0 2000 4000 6000 8000 train error RCV1, truncated PGSMD PGSMD-eff MSPG 0 2000 4000 6000 8000 train error URL, truncated PGSMD PGSMD-eff MSPG 0 50 100 cpu time a9a, truncated PGSMD PGSMD-eff MSPG 0 500 1000 1500 2000 2500 covtype, truncated PGSMD PGSMD-eff MSPG 0 2000 4000 6000 8000 RCV1, truncated PGSMD PGSMD-eff MSPG 0 2000 4000 6000 8000 URL, truncated PGSMD PGSMD-eff MSPG Figure 2. Results of variance-based regularization with for (non-convex) truncated logistic loss. close or even a lower level compared to that from solving the min-max formulation (9), which verifies the efficacy of the inf-projection formulation. Second, the proposed stochastic algorithms have significant improvement in the convergence time of training/testing errors, especially on large datasets, covtype, RCV1 and URL, which can be verified by comparing convergence of training/testing errors against cpu time. 6. Conclusion In this paper, we design and analyze stochastic optimization algorithms for a family of inf-projection minimization problems. We show that the concerned inf-projection structure covers a variety of special cases, including DC functions and bi-convex functions as special cases (non-smooth functions in Section 4) and another family of inf-projection formulations (smooth functions in Section 3). We develop stochastic optimization algorithms for those problems with theoretical guarantees of their first-order convergence for finding a (nearly) ϵ-stationary solution at O(1/ϵ4/v). To the best of our knowledge, this is the first work to provide comprehensive convergence analysis for stochastic optimization of non-convex inf-projection minimization problems. Additionally, to verify the significance of our inf-projection formulation, we investigate an important machine learning problem, variance-based regularization, and compare our algorithms with baselines for min-max formulation (distributionally robust optimization). Empirical results demonstrate the significance and effectiveness of our proposed algorithms. Stochastic Optimization for Non-convex Inf-Projection Problems Bolte, J., Sabach, S., and Teboulle, M. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program., 146(1-2):459 494, August 2014. ISSN 0025-5610. Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004. Chen, Z., Yang, T., Yi, J., Zhou, B., and Chen, E. Universal stagewise learning for non-convex problems with convergence on averaged solutions. Co RR, /abs/1808.06296, 2018. Davis, D. and Drusvyatskiy, D. Stochastic model-based minimization of weakly convex functions. Co RR, abs/1803.06523, 2018a. Davis, D. and Drusvyatskiy, D. Stochastic subgradient method converges at the rate o(k 1/4) on weakly convex functions. Co RR, /abs/1802.02988, 2018b. Davis, D. and Grimmer, B. Proximally guided stochastic subgradient method for nonsmooth, nonconvex problems. ar Xiv preprint ar Xiv:1707.03505, 2017. Davis, D., Edmunds, B., and Udell, M. The sound of apalm clapping: Faster nonsmooth nonconvex optimization with stochastic asynchronous palm. In Advances in Neural Information Processing Systems, pp. 226 234, 2016. Driggs, D., Tang, J., Davies, M., and Sch onlieb, C.-B. Spring: A fast stochastic proximal alternating method for non-smooth non-convex optimization. ar Xiv preprint ar Xiv:2002.12266, 2020. Ghadimi, S., Lan, G., and Zhang, H. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program., 155(1-2):267 305, 2016. Gorski, J., Pfeuffer, F., and Klamroth, K. Biconvex sets and optimization with biconvex functions: a survey and extensions. Mathematical Methods of Operations Research, 66(3):373 407, Dec 2007. Hong, M., Razaviyayn, M., Luo, Z.-Q., and Pang, J.-S. A unified algorithmic framework for block-structured optimization involving big data: With applications in machine learning and signal processing. IEEE Signal Processing Magazine, 33(1):57 77, 2015. Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M. Positive-unlabeled learning with non-negative risk estimator. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 1675 1685. Curran Associates, Inc., 2017. Kumar, M. P., Packer, B., and Koller, D. Self-paced learning for latent variable models. In Neural Information Processing Systems 23, pp. 1189 1197, 2010. Maurer, A. and Pontil, M. Empirical bernstein bounds and sample variance penalization. ar Xiv preprint ar Xiv:0907.3740, 2009. Namkoong, H. and Duchi, J. C. Stochastic gradient methods for distributionally robust optimization with fdivergences. In Advances in Neural Information Processing Systems (NIPS), pp. 2208 2216, 2016. Namkoong, H. and Duchi, J. C. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems (NIPS), pp. 2975 2984, 2017. Nesterov, Y. Universal gradient methods for convex optimization problems. Mathematical Programming, 152(1): 381 404, Aug 2015. ISSN 1436-4646. doi: 10.1007/ s10107-014-0790-0. URL https://doi.org/10. 1007/s10107-014-0790-0. Nitanda, A. and Suzuki, T. Stochastic difference of convex algorithm and its application to training deep boltzmann machines. In Artificial Intelligence and Statistics, pp. 470 478, 2017. Rafique, H., Liu, M., Lin, Q., and Yang, T. Non-convex minmax optimization: Provable algorithms and applications in machine learning. Co RR, abs/1810.02060, 2018. Rockafellar, R. and Wets, R. J.-B. Variational Analysis. Springer Verlag, Heidelberg, Berlin, New York, 1998. Rockafellar, R. T. and Wets, R. J.-B. Variational analysis, volume 317. Springer Science & Business Media, 2009. Shah, S., Yadav, A. K., Castillo, C. D., Jacobs, D. W., Studer, C., and Goldstein, T. Biconvex relaxation for semidefinite programming in computer vision. In ECCV (6), volume 9910 of Lecture Notes in Computer Science, pp. 717 735. Springer, 2016. Shalev-Shwartz, S. and Singer, Y. On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms. Machine learning, 80 (2-3):141 163, 2010. Thi, H. A. L., Le, H. M., Phan, D. N., and Tran, B. Stochastic DCA for the large-sum of non-convex functions problem and its application to group variable selection in classification. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3394 3403, International Convention Centre, Sydney, Australia, 06 11 Aug Stochastic Optimization for Non-convex Inf-Projection Problems 2017. PMLR. URL http://proceedings.mlr. press/v70/thi17a.html. Xu, Y. and Yin, W. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences, 6(3):1758 1789, 2013. Xu, Y., Qi, Q., Lin, Q., Jin, R., and Yang, T. Stochastic optimization for dc functions and non-smooth non-convex regularizers with non-asymptotic convergence. ar Xiv preprint ar Xiv:1811.11829, 2018a. Xu, Y., Zhu, S., Yang, S., Zhang, C., Jin, R., and Yang, T. Learning with non-convex truncated losses by SGD. Co RR, abs/1805.07880, 2018b. URL http://arxiv. org/abs/1805.07880. Xu, Y., Jin, R., and Yang, T. Stochastic proximal gradient methods for non-smooth non-convex regularized problems. Co RR, abs/1902.07672, 2019. Zhao, P. and Zhang, T. Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pp. 1 9, 2015.