# stochastic_threecomposite_convex_minimization__1a8bcb67.pdf Stochastic Three-Composite Convex Minimization Alp Yurtsever, B ang Công V u, and Volkan Cevher Laboratory for Information and Inference Systems (LIONS) École Polytechnique Fédérale de Lausanne, Switzerland alp.yurtsever@epfl.ch, bang.vu@epfl.ch, volkan.cevher@epfl.ch We propose a stochastic optimization method for the minimization of the sum of three convex functions, one of which has Lipschitz continuous gradient as well as restricted strong convexity. Our approach is most suitable in the setting where it is computationally advantageous to process smooth term in the decomposition with its stochastic gradient estimate and the other two functions separately with their proximal operators, such as doubly regularized empirical risk minimization problems. We prove the convergence characterization of the proposed algorithm in expectation under the standard assumptions for the stochastic gradient estimate of the smooth term. Our method operates in the primal space and can be considered as a stochastic extension of the three-operator splitting method. Numerical evidence supports the effectiveness of our method in real-world problems. 1 Introduction We propose a stochastic optimization method for the three-composite minimization problem: minimize x Rd f(x) + g(x) + h(x), (1) where f : Rd R and g : Rd R are proper, lower semicontinuous convex functions that admit tractable proximal operators, and h : Rd R is a smooth function with restricted strong convexity. We assume that we have access to unbiased, stochastic estimates of the gradient of h in the sequel, which is key to scale up optimization and to address streaming settings where data arrive in time. Template (1) covers a large number of applications in machine learning, statistics, and signal processing by appropriately choosing the individual terms. Operator splitting methods are powerful in this setting, since they reduce the complex problem (1) into smaller subproblems. These algorithms are easy to implement, and they typically exhibit state-of-the-art performance. To our knowledge, there is no operator splitting framework that can currently tackle template (1) using stochastic gradient of h and the proximal operators of f and g separately, which is critical to the scalability of the methods. This paper specifically bridges this gap. Our basic framework is closely related to the deterministic three operator splitting method proposed in [11], but we avoid the computation of the gradient h and instead work with its unbiased estimates. We provide rigorous convergence guarantees for our approach and provide guidance in selecting the learning rate under different scenarios. Road map. Section 2 introduces the basic optimization background. Section 3 then presents the main algorithm and provides its convergence characterization. Section 4 places our contributions in light of the existing work. Numerical evidence that illustrates our theory appears in Section 5. We relegate the technical proofs to the supplementary material. 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. 2 Notation and background This section recalls a few basic notions from the convex analysis and the probability theory, and presents the notation used in the rest of the paper. Throughout, Γ0(Rd) denotes the set of all proper, lower semicontinuous convex functions from Rd to [ , + ], and | is the standard scalar product on Rd with its associated norm . Subdifferential. The subdifferential of f Γ0(Rd) at a point x Rd is defined as f(x) = {u Rd | f(y) f(x) y x | u , y Rd}. We denote the domain of f as dom( f) = {x Rd | f(x) = }. If f(x) is a singleton, then f is a differentiable function, and f(x) = { f(x)}. Indicator function. Given a nonempty subset C in Rd, the indicator function of C is given by ιC(x) = 0 if x C, + if x C. (2) Proximal operator. The proximal operator of a function f Γ0(Rd) is defined as follows proxf(x) = arg min z Rd 2 z x 2 . (3) Roughly speaking, the proximal operator is tractable when the computation of (3) is cheap. If f is the indicator function of a nonempty, closed convex subset C, its proximity operator is the projection operator on C. Lipschitz continuos gradient. A function f Γ0(Rd) has Lipschitz continuous gradient with Lipschitz constant L > 0 (or simply L-Lipschitz), if f(x) f(y) L x y , x, y Rd. Strong convexity. A function f Γ0(Rd) is called strongly convex with some parameter µ > 0 (or simply µ-strongly convex), if p q | x y µ x y 2, x, y dom( f), p f(x), q f(y). Solution set. We denote optimum points of (1) by x , and the solution set by X : x X = {x Rd | 0 h(x) + g(x) + f(x)}. Throughout this paper, we assume that X is not empty. Restricted strong convexity. A function f Γ0(Rd) has restricted strong convexity with respect to a point x in a set M dom( f), with parameter µ > 0, if p q | x x µ x x 2, x M, p f(x), q f(x ). Let (Ω, F, P) be a probability space. An Rd-valued random variable is a measurable function x: Ω Rd, where Rd is endowed with the Borel σ-algebra. We denote by σ(x) the σ-field generated by x. The expectation of a random variable x is denoted by E[x]. The conditional expectation of x given a σ-field A F is denoted by E[x|A]. Given a random variable y: Ω Rd, the conditional expectation of x given y is denoted by E[x|y]. See [17] for more details on probability theory. An Rd-valued random process is a sequence (xn)n N of Rd-valued random variables. 3 Stochastic three-composite minimization algorithm and its analysis We present stochastic three-composite minimization method (S3CM) in Algorithm 1, for solving the three-composite template (1). Our approach combines the stochastic gradient of h, denoted as r, and the proximal operators of f and g in essentially the same structrure as the three-operator splitting method [11, Algorithm 2]. Our technique is a nontrivial combination of the algorithmic framework of [11] with stochastic analysis. Algorithm 1 Stochastic three-composite minimization algorithm (S3CM) Input: An initial point xf,0, a sequence of learning rates (γn)n N, and a sequence of squared integrable Rd-valued stochastic gradient estimates (rn)n N. Initialization: xg,0 = proxγ0g(xf,0) ug,0 = γ 1 0 (xf,0 xg,0) Main loop: for n = 0, 1, 2, . . . do xg,n+1 = proxγng(xf,n + γnug,n) ug,n+1 = γ 1 n (xf,n xg,n+1) + ug,n xf,n+1 = proxγn+1f(xg,n+1 γn+1ug,n+1 γn+1rn+1) end for Output: xg,n as an approximation of an optimal solution x . Theorem 1 Assume that h is µh-strongly convex and has L-Lipschitz continuous gradient. Further assume that g is µg-strongly convex, where we allow µg = 0. Consider the following update rule for the learning rate: γn+1 = γ2 nµhη + p (γ2nµhη)2 + (1 + 2γnµg)γ2n 1 + 2γnµg , for some γ0 > 0 and η ]0, 1[. Define Fn = σ(xf,k)0 k n, and suppose that the following conditions hold for every n N: 1. E[rn+1|Fn] = h(xg,n+1) almost surely, 2. There exists c [0, + [ and t R, that satisfies Pn k=0 E[ rk h(xg,k) 2] cnt. Then, the iterates of S3CM satisfy E[ xg,n x 2] = O(1/n2) + O(1/n2 t). (4) Remark 1 The variance condition of the stochastic gradient estimates in the theorems above is satisfied when E[ rn h(xg,n) 2] c for all n N and for some constant c [0, + [. See [15, 22, 26] for details. Remark 2 When rn = h(xn), S3CM reduces to the deterministic three-operator splitting scheme [11, Algorithm 2] and we recover the convergence rate O(1/n2) as in [11]. When g is zero, S3CM reduces to the standard stochastic proximal point algorithm [2, 13, 26]. Remark 3 Learning rate sequence (γn)n N in Theorem 1 depends on the strong convexity parameter µh, which may not be available a priori. Our next result avoids the explicit reliance on the strong convexity parameter, while providing essentially the same convergence rate. Theorem 2 Assume that h is µh-strongly convex and has L-Lipschitz continuous gradient. Consider a positive decreasing learning rate sequence γn = Θ(1/nα) for some α ]0, 1], and denote β = limn 2µhnαγn. Define Fn = σ(xf,k)0 k n, and suppose that the following conditions hold for every n N: 1. E[rn+1|Fn] = h(xg,n+1) almost surely, 2. E[ rn h(xg,n) 2] is uniformly bounded by some positive constant. 3. E[ ug,n x 2] is uniformly bounded by some positive constant. Then, the iterates of S3CM satisfy E[ xg,n x 2] = O 1/nα if 0 < α < 1 O 1/nβ if α = 1, and β < 1 O (log n)/n if α = 1, and β = 1, O 1/n if α = 1, and β > 1. Proof outline. We consider the proof of three-operator splitting method as a baseline, and we use the stochastic fixed point theory to derive the convergence of the iterates via the stochastic Fejér monotone sequence. See the supplement for the complete proof. Remark 4 Note that ug,n g(xg,n). Hence, we can replace condition 3 in Theorem 2 with the bounded subgradient assumption: p c, p g(xg,n), for some positive constant c. Remark 5 (Restricted strong convexity) Let M be a subset of Rd that contains (xg,n)n N and x . Suppose that h has restricted strong convexity on M with parameter µh. Then, Theorems 1 and 2 still hold. An example role of the restricted strong convexity assumption on algorithmic convergence can be found in [1, 21]. Remark 6 (Extension to arbitrary number of non-smooth terms.) Using the product space technique [5, Section 6.1], S3CM can be applied to composite problems with arbitrary number of non-smooth terms: minimize x Rd i=1 fi(x) + h(x), where fi : Rd R are proper, lower semicontinuous convex functions, and h : Rd R is a smooth function with restricted strong convexity. We present this variant in Algorithm 2. Theorems 1 and 2 hold for this variant, replacing xg,n by xn, and ug,n by ui,n for i = 1, 2, . . . , m. Algorithm 2 Stochastic m(ulti)-composite minimization algorithm (Sm CM) Input: Initial points {xf1,0, xf2,0, . . . , xfm,0}, a sequence of learning rates (γn)n N, and a sequence of squared integrable Rd-valued stochastic gradient estimates (rn)n N Initialization: x0 = m 1 Pm i=1 xfi,0 for i=1,2,...,m do ui,0 = γ 1 0 (xfi,0 x0) end for Main loop: for n = 0, 1, 2, . . . do xn+1 = m 1 Pm i=1(xfi,n + γnui,n) for i=1,2,...,m do ui,n+1 = γ 1 n (xfi,n xn+1) + ui,n xfi,n+1 = proxγn+1mfi(xn+1 γn+1ui,n+1 γn+1rn+1) end for end for Output: xn as an approximation of an optimal solution x . Remark 7 With a proper learning rate, S3CM still converges even if h is not (restricted) strongly convex under mild assumptions. Suppose that h has L-Lipschitz continuous gradient. Set the learning rate such that ε γn γ α(2L 1 ε), for some α and ε in ]0, 1[. Define Fn = σ(xf,k)0 k n, and suppose that the following conditions hold for every n N: 1. E[rn+1|Fn] = h(xg,n+1) almost surely. n N E[ rn+1 h(xg,n+1) 2|Fn] < + almost surely. Then, (xg,n)n N converges to a X -valued random vector almost surely. See [7] for details. Remark 8 All the results above hold for any separable Hilbert space, except that the strong convergence in Remark 7 is replaced by weak convergence. Note however that extending Remark 7 to variable metric setting as in [10, 27] is an open problem. 4 Contributions in the light of prior work Recent algorithms in the operator splitting, such as generalized forward-backward splitting [24], forward-Douglas-Rachford splitting [5], and the three-operator splitting [11], apply to our problem template (1). These key results, however, are in the deterministic setting. Our basic framework can be viewed as a combination of the three-operator splitting method in [11] with the stochastic analysis. The idea of using unbiased estimates of the gradient dates back to [25]. Recent developments of this idea can be viewed as proximal based methods for solving the generic composite convex minimization template with a single non-smooth term [2, 9, 12, 13, 15, 16, 19, 26, 23]. This generic form arises naturally in regularized or constrained composite problems [3, 13, 20], where the smooth term typically encodes the data fidelity. These methods require the evaluation of the joint prox of f and g when applied to the three-composite template (1). Unfortunately, evaluation of the joint prox is arguably more expensive compared to the individual prox operators. To make comparison stark, consider the simple example where f and g are indicator functions for two convex sets. Even if the projection onto the individual sets are easy to compute, projection onto the intersection of these sets can be challenging. Related literature also contains algorithms that solve some specific instances of template (1). To point out a few, random averaging projection method [28] handles multiple constraints simultaneously but cannot deal with regularizers. On the other hand, accelerated stochastic gradient descent with proximal average [29] can handle multiple regularizers simultaneously, but the algorithm imposes a Lipschitz condition on regularizers, and hence, it cannot deal with constraints. To our knowledge, our method is the first operator splitting framework that can tackle optimization template (1) using the stochastic gradient estimate of h and the proximal operators of f and g separately, without any restriction on the non-smooth parts except that their subdifferentials are maximally monotone. When h is strongly convex, under mild assumptions, and with a proper learning rate, our algorithm converges with O(1/n) rate, which is optimal for the stochastic methods under strong convexity assumption for this problem class. 5 Numerical experiments We present numerical evidence to assess the theoretical convergence guarantees of the proposed algorithm. We provide two numerical examples from Markowitz portfolio optimization and support vector machines. As a baseline, we use the deterministic three-operator splitting method [11]. Even though the random averaging projection method proposed in [28] does not apply to our template (1) with its all generality, it does for the specific applications that we present below. In our numerical tests, however, we observed that this method exhibits essentially the same convergence behavior as ours when used with the same learning rate sequence. For the clarity of the presentation, we omit this method in our results. 5.1 Portfolio optimization Traditional Markowitz portfolio optimization aims to reduce risk by minimizing the variance for a given expected return. Mathematically, we can formulate this as a convex optimization problem [6]: minimize x Rd E |a T i x b|2 subject to x , a T av x b, where is the standard simplex for portfolios with no-short positions or a simple sum constraint, aav = E [ai] is the average returns for each asset that is assumed to be known (or estimated), and b encodes a minimum desired return. This problem has a streaming nature where new data points arrive in time. Hence, we typically do not have access to the whole dataset, and the stochastic setting is more favorable. For implementation, we replace the expectation with the empirical sample average: minimize x Rd 1 p i=1 (a T i x b)2 subject to x , a T av x b. (5) This problem fits into our optimization template (1) by setting i=1 (a T i x b)2, g(x) = ι (x), and f(x) = ι{x | a T avx b}(x). We compute the unbiased estimates of the gradient by rn = 2(a T inx b)ain, where index in is chosen uniformly random. We use 5 different real portfolio datasets: Dow Jones industrial average (DJIA, with 30 stocks for 507 days), New York stock exchange (NYSE, with 36 stocks for 5651 days), Standard & Poor s 500 (SP500, with 25 stocks for 1276 days), Toronto stock exchange (TSE, with 88 stocks for 1258 days) that are also considered in [4]; and one dataset by Fama and French (FF100, 100 portfolios formed on size and book-to-market, 23,647 days) that is commonly used in financial literature, e.g., [6, 14]. We impute the missing data in FF100 using nearest-neighbor method with Euclidean distance. Figure 1: Comparison of the deterministic three-operators splitting method [11, Algorithm 2] and our stochastic three-composite minimization method (S3CM) for Markowitz portfolio optimization (5). Results are averaged over 100 Monte-Carlo simulations, and the boundaries of the shaded area are the best and worst instances. For the deterministic algorithm, we set η = 0.1. We evaluate the Lipschitz constant L and the strong convexity parameter µh to determine the step-size. For the stochastic algorithm, we do not have access to the whole data, so we cannot compute these parameter. Hence, we adopt the learning rate sequence defined in Theorem 2. We simply use γn = γ0/(n + 1) with γ0 = 1 for FF100, and γ0 = 103 for others.1 We start both algorithms from the zero vector. 1Note that a fine-tuned learning rate with a more complex definition can improve the empirical performance, e.g., γn = γ0/(n + ζ) for some positive constants γ0 and ζ. We split all the datasets into test (10%) and train (90%) partitions randomly. We set the desired return as the average return over all assets in the training set, b = mean(aav). Other b values exhibit qualitatively similar behavior. The results of this experiment are compiled in Figure 1. We compute the objective function over the datapoints in the test partition, htest. We compare our algorithm against the deterministic threeoperator splitting method [11, Algorithm 2]. Since we seek statistical solutions, we compare the algorithms to achieve low to medium accuracy. [11] provides other variants of the deterministic algorithm, including two ergodic averaging schemes that feature improved theoretical rate of convergence. However, these variants performed worse in practice than the original method, and are omitted. Solid lines in Figure 1 present the average results over 100 Monte-Carlo simulations, and the boundaries of the shaded area are the best and worst instances. We also assess empirical evidence of the O(1/n) convergence rate guaranteed in Theorem 2, by presenting squared relative distance to the optimum solution for FF100 dataset. Here, we approximate the ground truth by solving the problem to high accuracy with the deterministic algorithm for 105 iterations. 5.2 Nonlinear support vector machines classification This section demonstrates S3CM on a support vector machines (SVM) for binary classification problem. We are given a training set A = {a1, a2, . . . , ad} and the corresponding class labels {b1, b2, . . . , bd}, where ai Rp and bi { 1, 1}. The goal is to build a model that assigns new examples into one class or the other correctly. As common in practice, we solve the dual soft-margin SVM formulation: minimize x Rd 1 2 j=1 K(ai, aj)bibjxixj i=1 xi subject to x [0, C]d, b T x = 0, where C [0, + [ is the penalty parameter and K : Rp Rp R is a kernel function. In our example we use the Gaussian kernel given by Kσ(ai, aj) = exp( σ ai aj 2) for some σ > 0. Define symmetric positive semidefinite matrix M Rd d with entries Mij = Kσ(ai, aj)bibj. Then the problem takes the form minimize x Rd 1 2x T Mx i=1 xi subject to x [0, C]d, b T x = 0. (6) This problem fits into three-composite optimization template (1) with i=1 xi, g(x) = ι[0,C]d(x), and f(x) = ι{x | b T x=0}(x). One can solve this problem using three-operator splitting method [11, Algorithm 1]. Note that proxf and proxg, which are projections onto the corresponding constraint sets, incur O(d) computational cost, whereas the cost of computing the gradient is O(d2). To compute an unbiased gradient estimate, we choose an index in uniformly random, and we form rn = d M inxin 1. Here M in denotes ith n column of matrix M, and 1 represents the vector of ones. We can compute rn in O(d) computations, hence each iteration of S3CM costs an order cheaper compared to deterministic algorithm. We use UCI machine learning dataset a1a , with d = 1605 datapoints and p = 123 features [8, 18]. Note that our goal here is to demonstrate the optimization performance of our algorithm for a real world problem, rather than competing the prediction quality of the best engineered solvers. Hence, to keep experiments simple, we fix problem parameters C = 1 and σ = 2 2, and we focus on the effects of algorithmic parameters on the convergence behavior. Since p < d, M is rank deficient and h is not strongly convex. Nevertheless we use S3CM with the learning rate γn = γ0/(n + 1) for various values of γ0. We observe O(1/n) empirical convergence rate on the squared relative error for large enough γ0, which is guaranteed under restricted strong convexity assumption. See Figure 2 for the results. Figure 2: [Left] Convergence of S3CM in the squared relative error with learning rate γn = γ0/(n + 1). [Right] Comparison of the deterministic three-operators splitting method [11, Algorithm 1] and S3CM with γ0 = 1 for SVM classification problem. Results are averaged over 100 Monte-Carlo simulations. Boundaries of the shaded area are the best and worst instances. Acknowledgments This work was supported in part by ERC Future Proof, SNF 200021-146750, SNF CRSII2-147633, and NCCR-Marvel. [1] A. Agarwal, S. Negahban, and M. J. Wainwright. Fast global convergence of gradient methods for high-dimensional statistical recovery. Ann. Stat., 40(5):2452 2482, 2012. [2] Y. F. Atchadé, G. Fort, and E. Moulines. On stochastic proximal gradient algorithms. ar Xiv:1402.2365v2, 2014. [3] H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces. Springer-Verlag, 2011. [4] A. Borodin, R. El-Yaniv, and V. Gogan. Can we learn to beat the best stock. In Advances in Neural Information Processing Systems 16, pages 345 352. 2004. [5] L. M. Briceño-Arias. Forward-Douglas Rachford splitting and forward-partial inverse method for solving monotone inclusions. Optimization, 64(5):1239 1261, 2015. [6] J. Brodie, I. Daubechies, C. de Mol, D. Giannone, and I. Loris. Sparse and stable Markowitz portfolios. Proc. Natl. Acad. Sci., 106:12267 12272, 2009. [7] V. Cevher, B. C. V u, and A. Yurtsever. Stochastic forward Douglas Rachford splitting for monotone inclusions. EPFL-Report-215759, 2016. [8] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2(3):27:1 27:27, 2011. [9] P. L. Combettes and J.-C. Pesquet. Stochastic approximations and perturbations in forwardbackward splitting for monotone operators. ar Xiv:1507.07095v1, 2015. [10] P. L. Combettes and B. C. V u. Variable metric forward backward splitting with applications to monotone inclusions in duality. Optimization, 63(9):1289 1318, 2014. [11] D. Davis and W. Yin. A three-operator splitting scheme and its optimization applications. ar Xiv:1504.01032v1, 2015. [12] O. Devolder. Stochastic first order methods in smooth convex optimization. Technical report, Center for Operations Research and Econometrics, 2011. [13] J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res., 10:2899 2934, 2009. [14] E. F. Fama and K. R. French. Multifactor explanations of asset pricing anomalies. Journal of Finance,, 51:55 84, 1996. [15] C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems 22, pages 781 789. 2009. [16] G. Lan. An optimal method for stochastic composite optimization. Math. Program., 133(1):365 397, 2012. [17] M. Ledoux and M. Talagrand. Probability in Banach spaces: Isoperimetry and processes. Springer-Verlag, 1991. [18] M. Lichman. UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences, 2013. [19] Q. Lin, X. Chen, and J. Peña. A smoothing stochastic gradient method for composite optimization. Optimization Methods and Software, 29(6):1281 1301, 2014. [20] S. Mosci, L. Rosasco, M. Santoro, A. Verri, and S. Villa. Solving structured sparsity regularization with proximal methods. In European Conf. Machine Learning and Principles and Practice of Knowledge Discovery, pages 418 433, 2010. [21] S. Negahban, B. Yu, M. J. Wainwright, and P. K. Ravikumar. A unified framework for highdimensional analysis of M-estimators with decomposable regularizers. In Advances in Neural Information Processing Systems 22, pages 1348 1356, 2009. [22] A. Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. on Optimization, 15(1):229 251, 2005. [23] A. Nitanda. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems 27, pages 1574 1582. 2014. [24] H. Raguet, J. Fadili, and G. Peyré. A generalized forward-backward splitting. SIAM Journal on Imaging Sciences, 6(3):1199 1226, 2013. [25] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400 407, 1951. [26] L. Rosasco, S. Villa, and B. C. V u. Convergence of stochastic proximal gradient algorithm. ar Xiv:1403.5074v3, 2014. [27] B. C. V u. Almost sure convergence of the forward backward forward splitting algorithm. Optimization Letters, 10(4):781 803, 2016. [28] M. Wang, Y. Chen, J. Liu, and Y. Gu. Random multi constraint projection: Stochastic gradient methods for convex optimization with many constraints. ar Xiv:1511.03760v1, 2015. [29] W. Zhong and J. Kwok. Accelerated stochastic gradient method for composite regularization. J. Mach. Learn. Res., 33:1086 1094, 2014.