# composite_selfconcordant_minimization__6cdb8995.pdf

Journal of Machine Learning Research 16 (2015) 371-416 Submitted 8/13; Revised 4/14; Published 3/15

Composite Self-Concordant Minimization

Quoc Tran-Dinh quoc.trandinh@epfl.ch Anastasios Kyrillidis anastasios.kyrillidis@epfl.ch Volkan Cevher volkan.cevher@epfl.ch Laboratory for Information and Inference Systems (LIONS) Ecole Polytechnique F ed erale de Lausanne (EPFL) CH1015-Lausanne, Switzerland

Editor: Benjamin Recht

We propose a variable metric framework for minimizing the sum of a self-concordant function and a possibly non-smooth convex function, endowed with an easily computable proximal operator. We theoretically establish the convergence of our framework without relying on the usual Lipschitz gradient assumption on the smooth part. An important highlight of our work is a new set of analytic step-size selection and correction procedures based on the structure of the problem. We describe concrete algorithmic instances of our framework for several interesting applications and demonstrate them numerically on both synthetic and real data. Keywords: proximal-gradient/Newton method, composite minimization, self-concordance, sparse convex optimization, graph learning

1. Introduction

The literature on the formulation, analysis, and applications of composite convex minimization is ever expanding due to its broad applications in machine learning, signal processing, and statistics. By composite minimization, we refer to the following optimization problem:

F := min x Rn {F(x) | F(x) := f(x) + g(x)} , (1)

where f and g are both closed and convex, and n is the problem dimension. In the canonical setting of the composite minimization problem (1), the functions f and g are assumed to be smooth and non-smooth, respectively (Nesterov, 2007). Such composite objectives naturally arise, for instance, in maximum a posteriori model estimation, where we regularize a model likelihood function as measured by a data-driven smooth term f with a non-smooth model prior g, which carries some notion of model complexity (e.g., sparsity, low-rankness, etc.). In theory, many convex problem instances of the form (1) have a well-understood structure, and hence high accuracy solutions can be eﬃciently obtained with polynomial time methods, such as interior point methods (IPM) after transforming them into conic quadratic programming or semideﬁnite programming formulations (Ben-Tal and Nemirovski, 2001; Grant et al., 2006; Nesterov and Nemirovski, 1994). In practice, however, the curse-ofdimensionality renders these methods impractical for large-scale problems. Moreover, the

c 2015 Quoc Tran-Dinh, Anastasios Kyrillidis, Volkan Cevher.

Tran-Dinh, Kyrillidis, and Cevher

F: smooth Class Property x, y dom(f), v Rn, 0 µ L < +

FL f(x) f(y) L x y

µ 2 x y 2 + f(x) + f(x)T (y x) f(y)

F2 |ϕ (t)| 2ϕ (t)3/2: ϕ(t) = f(x + tv), t R

F2,ν F2 and supv Rn 2 f(x)T v v 2 x ν

Figure 1: Common structural assumptions on the smooth function f.

presence of a non-smooth term g prevents direct applications of scalable smooth optimization techniques, such as sequential linear or quadratic programming. Fortunately, we can provably trade-oﬀaccuracy with computation by further exploiting the individual structures of f and g. Existing methods invariably rely on two structural assumptions that particularly stand out among many others. First, we often assume that f has Lipschitz continuous gradient (i.e., f FL: cf., Figure 1). Second, we assume that the proximal operator of g (prox H g (y) := arg min x Rn g(x) + (1/2) x y 2 H ) is, in a

user-deﬁned sense, easy to compute for some H 0 (e.g., H is diagonal); i.e., we can computationally aﬀord to apply the proximal operator in an iterative fashion. In this case, g is said to be tractably proximal . On the basis of these structures, we can design algorithms featuring a full spectrum of (nearly) dimension-independent, global convergence rates with well-understood analytical complexity (see Table 1).

Order Method example Main oracle Analytical complexity

1-st [Accelerated]a-[proximal]-gradientb f, prox LIn g [O(ϵ 1/2)] O(ϵ 1)

1+-th Proximal-quasi-Newtonc Hk, f, prox Hk g O(log ϵ 1) or faster

2-nd Proximal-Newtond 2f, f, prox 2f g O(log log ϵ 1)[local]

See (Beck and Teboulle, 2009a)a,b,(Becker and Fadili, 2012)c,(Lee et al., 2012)d,(Nesterov, 2004, 2007)a,b.

Table 1: Taxonomy of [accelerated] [proximal]-gradient methods when f FL or proximal- [quasi]-Newton methods when f FL Fµ to reach an ε-solution (e.g., F(xk) F ϵ).

Unfortunately, existing algorithms have become inseparable with the Lipschitz gradient assumption on f and are still being applied to solve (1) in applications where this assumption does not hold. For instance, when prox H g (y) is not easy to compute, it is still possible to establish convergence albeit slower with smoothing, splitting or primal-dual decomposition techniques (Chambolle and Pock, 2011; Eckstein and Bertsekas, 1992; Nesterov, 2005a,b; Tran-Dinh et al., 2013c). However, when f / FL, the composite problems of the form (1) are not within the full theoretical grasp. In particular, there is no known global convergence rate. One kludge to handle f / FL is to use sequential quadratic approximation of f to reduce the subproblems to the Lipschitz gradient case. For local convergence of these methods, we need strong regularity assumptions on f (i.e., µI 2f(x) LI) near

Composite Self-concordant Minimization

the optimal solution. Attempts at global convergence require a globalization strategy such as line search procedures (cf., Section 1.2). However, neither the strong regularity nor the line search assumptions can be certiﬁed a priori. To this end, we address the following question in this paper: Is it possible to efﬁciently solve non-trivial instances of (1) for non-global Lipschitz continuous gradient f with rigorous global convergence guarantees? The answer is positive (at least for a broad class of functions): We can still cover a full spectrum of global convergence rates with well-characterizable computation and accuracy trade-oﬀs (akin to Table 1 for f FL) for self-concordant f (in particular, self-concordant barriers) (Nemirovskii and Todd, 2008; Nesterov and Nemirovski, 1994):

Deﬁnition 1 (Self-concordant (barrier) functions) A convex function f : Rn R is said to be self-concordant (i.e., f FM ) with parameter M 0, if |ϕ (t)| Mϕ (t)3/2, where ϕ(t) := f(x + tv) for all t R, x dom(f) and v Rn such that x + tv dom(f). When M = 2, the function f is said to be a standard self-concordant, i.e., f F2.1 A standard self-concordant function f F2 is a ν-self-concordant barrier of a given convex set Ωwith parameter ν > 0, i.e., f F2,ν, when ϕ also satisﬁes |ϕ (t)| νϕ (t)1/2 and f(x) + as x Ω, the boundary of Ω.

While there are other deﬁnitions of self-concordant functions and self-concordant barriers (Boyd and Vandenberghe, 2004; Nemirovskii and Todd, 2008; Nesterov and Nemirovski, 1994; Nesterov, 2004), we use Deﬁnition 1 in the sequel, unless otherwise stated.

1.1 Why is the Assumption f F2 Interesting for Composite Minimization?

The assumption f F2 in (1) is quite natural for two reasons. First, several important applications directly feature a self-concordant f, which does not have global Lipschitz continuous gradient. Second, self-concordant composite problems can enable approximate solutions of general constrained convex problems where the constraint set is endowed with a ν-selfconcordant barrier function.2 Both settings clearly beneﬁt from scalable algorithms. Hence, we now highlight three examples below, based on compositions with the log-functions. Keep in mind that this list of examples is not meant to be exhaustive. Log-determinant: The matrix variable function f(Θ) := log det Θ is self-concordant with dom(f) := {Θ Sp | Θ 0}, where Sp is the set of p p symmetric matrices. As a stylized application, consider learning a Gaussian Markov random ﬁeld (GMRF) of p nodes/variables from a data set D := {φ1, φ2, . . . , φm}, where φj D is a p-dimensional random vector with Gaussian distribution N(µ, Σ). Let Θ := Σ 1 be the inverse covariance (or the precision) matrix for the model. To satisfy the conditional dependencies with respect to the GMRF, Θ must have zero in (Θ)ij corresponding to the absence of an edge between node i and node j; cf., (Dempster, 1972).

1. We use this constant for convenience in the derivations since if f FM, then (M 2/4)f F2. 2. Let us consider a constrained convex minimization x C := arg minx C g(x), where the feasible convex set C is endowed with a ν-self-concordant barrier ΨC(x). If we let f(x) := ϵ

ν ΨC(x), then the solution x of the composite minimization problem (1) well-approximates x C as g(x ) g(x C) + ( f(x ) + g(x ))T (x x C) + ϵ. The middle term can be controlled by accuracy at which we solve the composite minimization problem (Nesterov, 2007, 2011).

Tran-Dinh, Kyrillidis, and Cevher

We can learn GMRFs with theoretical guarantees from as few as O(d2 log p) data samples, where d is the graph node degree, via ℓ1-norm regularization formulation (see Ravikumar et al. 2011):

Θ := arg min Θ 0

n log det(Θ) + tr(bΣΘ) | {z } =:f(Θ)

+ ρ vec(Θ) 1 | {z } =:g(Θ)

where ρ > 0 parameter balances a Gaussian model likelihood and the sparsity of the solution, bΣ is the empirical covariance estimate, and vec is the vectorization operator. The formulation also applies for learning models beyond GMRFs, such as the Ising model, since f(Θ) acts also as a Bregman distance (Banerjee et al., 2008). Numerical solution methods for solving problem (2) have been extensively studied, e.g. in (Banerjee et al., 2008; Hsieh et al., 2011; Lee et al., 2012; Lu, 2010; Olsen et al., 2012; Rolfs et al., 2012; Scheinberg and Rish, 2009; Scheinberg et al., 2010; Yuan, 2012). However, none so far exploits f F2,ν and feature global convergence guarantees: cf., Sect. 1.2. Log-barrier for linear inequalities: The function f(x) := log(a T x b) is a selfconcordant barrier with dom(f) := x Rn | a T x > b . As a stylized application, consider the low-light imaging problem in signal processing (Harmany et al., 2012), where the imaging data is collected by counting photons hitting a detector over the time. In this setting, we wish to accurately reconstruct an image in low-light, which leads to noisy measurements due to low photon count levels. We can express our observation model using the Poisson distribution as

P(y|A(x)) =

(a T i x)yi

yi! e a T i x,

where x is the true image, A is a linear operator that projects the scene onto the set of observations, ai is the i-th row of A, and y Zm + is a vector of observed photon counts. Via the log-likelihood formulation, we stumble upon a composite minimization problem:

x := arg min x Rn

i=1 a T i x

i=1 yi log(a T i x)

| {z } =:f(x)

+g(x) o , (3)

where f(x) is self-concordant (but not standard). In the above formulation, the typical image priors g(x) include the ℓ1-norm for sparsity in a known basis, total variation seminorm of the image, and the positivity of the image pixels. While the formulation (3) seems speciﬁc to imaging, it is also common in sparse regression with unknown noise variance (St adler et al., 2012), heteroscedastic LASSO (Dalalyan et al., 2013), barrier approximations of, e.g., the Dantzig selector (Candes and Tao, 2007) and quantum tomography (Banaszek et al., 1999) as well. The current state of the art solver is called SPIRAL-TAP (Harmany et al., 2012), which biases the logarithmic term (i.e., log(a T i x + ε) log(a T i x), where ε 1) and then applies non-monotone composite gradient descent algorithms for FL with a Barzilai-Borwein stepsize as well as other line-search strategies. Logarithm of concave quadratic functions: The function f(x) := log σ2 Ax y 2 2

is self-concordant with dom(f) := x Rn | Ax y 2 2 < σ2 . As a stylized application,

Composite Self-concordant Minimization

we consider the basis pursuit denoising (BPDN) formulation (van den Berg and Friedlander, 2008) as x := arg min x Rn

n g(x) | Ax y 2 2 σ2o . (4)

The BPDN criteria is commonly used in magnetic resonance imaging (MRI) where A is a subsampled Fourier operator, y is the MRI scan data, and σ2 is a known machine noise level (i.e., obtained during a pre-scan). In (4), g is an image prior, e.g., similar to the Poisson imaging problem. Approximate solutions to (4) can be obtained via a barrier formulation

x t := arg min x Rn

n t log σ2 Ax y 2 2

| {z } =:f(x)

+ g(x) o , (5)

where t > 0 is a penalty parameter which controls the quality of the approximation. The BPDN formulation is quite generic and has several other applications in statistical regression, geophysics, and signal processing. Several diﬀerent approaches solve the BPDN problem (4), some of which require projections onto the constraint set, including Douglas-Rachford splitting, proximal methods, and the SPGL1 method (van den Berg and Friedlander, 2008; Combettes and Wajs, 2005).

1.2 Related Work

Our attempt is to brieﬂy describe the work that revolves around (1) with the main assumptions of f FL and the proximal operator of g being computationally tractable. In fact, Douglas-Rachford splitting methods can obtain numerical solutions to (1) when the self-concordant functions are endowed with tractable proximal maps. However, it is computationally easier to calculate the gradient of f F2 than their proximal maps. One of the main approaches in this setting is based on operator splitting. By presenting the optimality condition of problem (1) as an inclusion of two monotone operators, one can apply splitting techniques, such as forward-backward or Douglas-Rachford methods, to solve the resulting monotone inclusion (Briceno-Arias and Combettes, 2011; Facchinei and Pang, 2003; Goldstein and Osher, 2009). In our context, several variants of this approach have been studied. For example, projected gradient or proximal-gradient methods and fast proximal-gradient methods have been considered, see, e.g., (Beck and Teboulle, 2009a; Mine and Fukushima, 1981; Nesterov, 2007). In all these methods, the main assumption required to prove the convergence is the global Lipschitz continuity of the gradient of the smooth function f. Unfortunately, when f / FL but f F2, these theoretical results on the global convergence and the global convergence rates are no longer applicable. Other mainstream approaches for (1) include augmented Lagrangian and alternating techniques: cf., (Boyd et al., 2011; Goldfarb and Ma, 2012). These methods have empirically proven to be quite powerful in speciﬁc applications. The main disadvantage of these methods is the manual tuning of the penalty parameter in the augmented Lagrangian function, which is not yet well-understood for general problems. Consequently, the analysis of global convergence as well as the convergence rate is an issue since the performance of the algorithms strongly depends on the choice of this penalty parameter in practice. Moreover, as indicated in a recent work (Goldstein et al., 2012), alternating direction methods of multipliers as well as alternating linearization methods can be viewed as splitting methods in

Tran-Dinh, Kyrillidis, and Cevher

the convex optimization context. Hence, it is unclear if this line of work is likely to lead to any rigorous guarantees when f F2. An emerging direction for solving composite minimization problems (1) is based on the proximal-Newton method. The origins of this method can be traced back to the work of (Bonnans, 1994), which relies on the concept of strong regularity introduced by (Robinson, 1980) for generalized equations. In the convex case, this method has been studied by several authors such as (Becker and Fadili, 2012; Lee et al., 2012; Schmidt et al., 2011). So far, methods along this line are applied to solve a generic problem of the form (1) even when f F2. The convergence analysis of these methods is encouraged by standard Newton methods and requires the strong regularity of the Hessian of f near the optimal solution (i.e., µI 2f(x) LI). This assumption used in (Lee et al., 2012) is stronger than assuming 2f(x ) to be positive deﬁnite at the solution x as in our approach below. Moreover, the global convergence can only be proved by applying a certain globalization strategy such as line-search (Lee et al., 2012) or trust-region. Unfortunately, none of these assumptions can be veriﬁed before the algorithm execution for the intended applications. By exploiting the self-concordance concept, we can show the global convergence of proximal-Newton methods without any globalization strategy (e.g., line search or trust-region approach).

1.3 Our Contributions

Interior point methods are always an option while solving the self-concordant composite problems (1) numerically by means of disciplined convex programming (Grant et al., 2006; L ofberg, 2004). More concretely, in the IPM setting, we set up an equivalent problem to (1) that typically avoids the non-smooth term g(x) in the objective by lifting the problem dimensions with slack variables and introducing additional constraints. The new constraints may then be embedded into the objective through a barrier function. We then solve a sequence of smooth problems (e.g., with Newton methods) and path-follow 3 to obtain an accurate solution (Nemirovskii and Todd, 2008; Nesterov, 2004). In this loop, many of the underlying structures within the original problem, such as sparsity, can be lost due to pre-conditioning or Newton direction scaling (e.g., Nesterov-Todd scaling, Nesterov and Todd 1997). The eﬃciency and the memory bottlenecks of the overall scheme then heavily depends on the workhorse algorithm that solves the smooth problems. In stark contrast, we introduce an algorithmic framework that directly handles the composite minimization problem (1) without increasing the original problem dimensions. For problems of larger dimensions, this is the main argument in favor of our approach. Instead of solving a sequence of smooth problems, we solve a sequence of non-smooth proximal problems with a variable metric (i.e., our workhorse). Fortunately, these proximal problems feature the composite form (1) with a Lipschitz gradient (and oft-times strongly convex) smooth term. Hence, we leverage the tremendous amount of research (cf., Table 1) done over the last decades. Surprisingly, we can even retain the original problem structures that lead to computational ease in many cases (e.g., see Section 4.1). Our speciﬁc contributions can be summarized as follows:

1. We develop a variable metric framework for minimizing the sum f + g of a selfconcordant function f and a convex, possibly nonsmooth function g. Our approach

3. It is also referred to as a homotopy method.

Composite Self-concordant Minimization

relies on the solution of a convex subproblem obtained by linearizing and regularizing the ﬁrst term f. To achieve monotonic descent, we develop a new set of analytic step-size selection and correction procedures based on the structure of the problem.

2. We establish both the global and the local convergence of diﬀerent variable metric strategies. We ﬁrst derive an expected result: when the variable metric is the Hessian 2f(xk) of f at iteration k, the resulting algorithm locally exhibits quadratic convergence rate within an explicit region. We then show that variable metrics satisfying the Dennis-Mor e-type condition (Dennis and Mor e, 1974) exhibit superlinear convergence.

3. We pay particular attention to diagonal variable metrics as many of the proximal subproblems can be solved exactly (i.e., in closed form). We derive conditions on when these variants achieve locally linear convergence.

4. We apply our algorithms to the aforementioned real-world and synthetic problems to highlight the strengths and the weaknesses of our scheme. For instance, in the graph learning problem (2), our framework can avoid matrix inversions as well as Cholesky decompositions in learning graphs. In Poisson intensity reconstruction (3), up to around 80 acceleration is possible over the state-of-the-art solver.

We highlight three key practical contributions to numerical optimization. First, in the proximal-Newton method, our analytical step-size procedures allow us to do away with any globalization strategy (e.g., line-search). This has a signiﬁcant practical impact when the evaluation of the functions is expensive. We show how to combine the analytical step-size selection with the standard backtracking or forward line-search procedures to enhance the global convergence of our method. Our analytical quadratic convergence characterization helps us adaptively switch from damped step-size to a full step-size. Second, in the proximalgradient method setting, we establish a step-size selection and correction mechanism. The step-size selection procedure can be considered as a predictor, where existing step-size rules that leverage local information can be used. The step-size corrector then adapts the local information of the function to achieve the best theoretical decrease in the objective function. While our procedure does not require any function evaluations, we can further enhance convergence whenever we are allowed function evaluations. Finally, our framework, as we demonstrate in (Tran-Dinh et al., 2014a), accommodates a path-following strategy, which enable us to approximately solve constrained non-smooth convex minimization problems with rigorous guarantees.

Paper outline. In Section 2, we ﬁrst recall some fundamental concepts of convex optimization and self-concordant functions used in this paper. Section 3 presents our algorithmic framework using three diﬀerent instances with convergence results, complexity estimates and modiﬁcations. Section 4 deals with three concrete instances of our algorithmic framework. Section 5 provides numerical experiments to illustrate the impact of the proposed methods. Section 6 concludes the paper.

Tran-Dinh, Kyrillidis, and Cevher

2. Preliminaries

Notation: We reserve lower-case and bold lower-case letters for scalar and vector representation, respectively. Upper-case bold letters denote matrices. We denote Sp + (reps., Sp ++) for the set of symmetric positive deﬁnite (reps., positive semideﬁnite) matrices of size p p. For a proper, lower semicontinuous convex function f from Rn to R {+ }, we denote its domain by dom(f), i.e., dom(f) := {x Rn | f(x) < + } (see, e.g., Rockafellar 1970). Weighted norm and local norm: Given a matrix H Sn ++, we deﬁne the weighted norm x H :=

x T Hx, x Rn; its dual norm is deﬁned as x H := max y H 1 y T x =

x T H 1x. If H is only positive semideﬁnite (i.e., H Sn +), then x H reduces to a semi-norm. Let f F2 and x dom(f) so that 2f(x) is positive deﬁnite. For a given vector v Rn, the local norm around x dom(f) with respect to f is deﬁned as v x := v T 2f(x)v 1/2, while the corresponding dual norm is given by v x = v T 2f(x) 1v 1/2. Subdiﬀerential and subgradient: Given a proper, lower semicontinuous convex function, we deﬁne the subdiﬀerential of g at x dom(g) as

g(x) := v Rn | g(y) g(x) v T (y x), y dom(g) .

If g(x) = then each element in g(x) is called a subgradient of g at x. In particular, if g is diﬀerentiable, we use g(x) to denote its derivative at x dom(g), and g(x) { f(x)}. Proximity operator: A basic tool to handle the nonsmoothness of a convex function g is its proximity operator (or proximal operator) prox H g , whose deﬁnition is given in Section 1. For notational convenience in our derivations, we alter this deﬁnition in the sequel as follows: Let g be a proper lower semicontinuous and convex in Rn and H Sn +. We deﬁne

P g H(u) := arg min x Rn g(x) + (1/2)x T Hx u T x , u Rn, (6)

as the proximity operator for the nonsmooth g, which has the following properties Hiriart Urruty and Lemar echal (2001).

Lemma 2 Assume that H Sn ++. Then, the operator P g H in (6) is single-valued and satisﬁes the following property:

(P g H(u) P g H(v))T (u v) P g H(u) P g H(v) 2 H , (7)

for all u, v Rn. Consequently, P g H is a nonexpansive mapping, i.e., P g H(u) P g H(v) H u v H . (8)

Proof This lemma is already known in the literature, see, e.g., (Rockafellar, 1976). For the sake of completeness, we give a short proof here. The single-valuedness of P g H is obvious due to the strong convexity of the objective function in (6). Let ξu := P g H(u) and ξv := P g H(v). By the deﬁnition of P g H, we have u Hξu g(ξu) and v Hξu g(ξv). Since g is convex, we have (u Hξu (v Hξv))T(ξu ξv) 0. This inequality leads to (u v)T (ξu ξv) (ξu ξv)T H(ξu ξv) = ξu ξv 2 H which is indeed (7). Via the generalized Cauchy-Schwarz inequality, (7) leads to (8).

Composite Self-concordant Minimization

Key self-concordant bounds: Based on (Nesterov, 2004, Theorems 4.1.7 and 4.1.8), for a given standard self-concordant function f, we recall the following inequalities

ω( y x x) + f(x)T (y x) + f(x) f(y), (9)

f(y) f(x) + f(x)T (y x) + ω ( y x x), (10)

where ω : R R+ is deﬁned as ω(t) := t ln(1 + t) and ω : [0, 1] R+ is deﬁned as ω (t) := t ln(1 t). These functions are both nonnegative, strictly convex and increasing. Hence, (9) holds for all x, y dom(f), and (10) holds for all x, y dom(f) such that y x x < 1. In contrast to the global inequalities for the function classes FL and Fµ (cf., Figure 1), the self-concordant inequalities are based on local quantities. Moreover, these bounds are no longer quadratic which prevents naive applications of the methods from FL,µ.

Remark 3 The proof of (9)-(10) is based on the condition 2f(x) 0 for all x dom(f), see (Nesterov, 2004). In this paper, we work with the function f deﬁned by f(x) := ϕ(Ax+ b), where ϕ is a standard self-concordant function such that 2ϕ(u) 0 for all u dom(ϕ). Therefore, we have 2f(x) = AT 2ϕ(Ax + b)A, which is possibly singular without further conditions on matrix A. Consequently, the local norm x deﬁned via 2f(x) reduces to a semi-norm. However, the inequalities (9)-(10) still hold w.r.t.this seminorm. Indeed, since ϕ is standard self-concordant with 2ϕ(u) 0 for all u dom(ϕ), we have ϕ(ˆu) ϕ(u) + ϕ(u)T (ˆu u) + ω ˆu u u . By substituting u = Ax + b dom(ϕ) and ˆu = Aˆx + b dom(ϕ), (x, ˆx dom(f)) into this inequality we obtain f(ˆx) f(x) + f(x)T (ˆx x) + ω( ˆx x x), which is indeed (9). The inequality (10) is proved similarly.

3. Composite Self-Concordant Optimization

In this section, we propose a variable metric optimization framework that rigorously trades oﬀcomputation and accuracy of solutions without transforming (1) into a higher dimension smooth convex optimization problem. We assume theoretically that the proximal subproblems can be solved exactly. However, our theory can be analyze for the inexact case, when we solve these problems up to a suﬃciently high accuracy (typically, it is at least higher than (e.g., 0.1ε) the desired accuracy ε of (1) at the few last iterations), see, e.g., (Tran-Dinh et al., 2013b, 2014a). In our theoretical characterizations, we only rely on the following assumption:

Assumption A.1 The function f is convex and standard self-concordant (see Deﬁnition 1). The function g : Rn R {+ } is proper, closed and convex.

Under Assumption A.1, we have dom(F) = dom(f) dom(g). Unique solvability of (1) and its optimality condition: First, we show that problem (1) is uniquely solvable. The proof of this lemma can be done similarly as (Nesterov, 2004, Theorem 4.1.11) and is provided in Appendix A.1.

Lemma 4 Suppose that the functions f and g of problem (1) satisfy Assumption A.1. If λ(x) := f(x) + v x < 1, for some x dom(F) and v g(x) such that 2f(x) 0, then the solution x of (1) exists and is unique.

Tran-Dinh, Kyrillidis, and Cevher

Since this problem is convex, the following optimality condition is necessary and suﬃcient:

0 f(x ) + g(x ). (11)

The solution x is called strongly regular if 2f(x ) 0. In this case, > σ max σ min > 0, where σ min and σ max are the smallest and the largest eigenvalue of 2f(x ), respectively. Fixed-point characterization: Let H Sn +. We deﬁne SH(x) := Hx f(x). Then, from (11), we have SH(x ) Hx f(x ) Hx + g(x ).

By using the deﬁnition of P g H( ) in (6), one can easily derive the ﬁxed-point expression:

x = P g H (SH(x )) , (12)

that is, x is the ﬁxed-point of the mapping Rg H( ), where Rg H( ) := P g H(SH( )). The formula in (12) suggests that we can generate an iterative sequence based on the ﬁxedpoint principle, i.e., xk+1 := Rg H(xk) starting from x0 dom(F) for k 0. Theoretically, under certain assumptions, one can ensure that the mapping Rg H is contractive and the sequence generated by this scheme is convergent. We note that if g 0 and H Sn ++, then P g H deﬁned by (6) reduces to P g H( ) = H 1( ). Consequently, the ﬁxed-point formula (12) becomes x = x H 1 f(x ), which is equivalent to f(x ) = 0. Our variable metric framework: Given a point xk dom(F) and a symmetric positive semideﬁnite matrix Hk, we consider the function

Q(x; xk, Hk) := f(xk) + f(xk)T (x xk) + 1

2(x xk)T Hk(x xk), (13)

for x dom(F). The function Q( ; xk, Hk) is seemingly a quadratic approximation of f around xk. Now, we study the following scheme to generate a sequence xk

xk+1 := xk + αkdk, (14)

where αk (0, 1] is a step size and dk is a search direction. Let sk be a solution of the following problem:

sk S(xk, Hk) := arg min x dom(F)

n Q(x; xk, Hk) + g(x) o = P g Hk

Hkxk f(xk) . (15)

Since we do not assume that Hk to be positive deﬁnite, the solution sk may not exist. We require the following assumption:

Assumption A.2 The subproblem (15) has at least one solution sk, i.e., S(xk, Hk) = .

In particular, if Hk Sn ++, then the solution sk of (15) exists and is unique, i.e., S(xk, Hk) = sk = . Up to now, we have not required the uniqueness of sk. This assumption will be speciﬁed later in the next sections. Throughout this paper, we assume that both Assumptions A.1 and A.2 are satisﬁed without referring to them speciﬁcally.

Composite Self-concordant Minimization

Now, given sk, the direction dk is computed as

dk := sk xk. (16)

If we deﬁne Gk := Hkdk, then Gk is called the gradient mapping of (1) (Nesterov, 2004), which behaves similarly as gradient vectors in non-composite minimization. Since problem (15) is solvable due to Assumption A.2, we can write its optimality condition as

0 f(xk) + Hk(sk xk) + g(sk). (17)

It is easy to see that if dk = 0, i.e., sk xk, then (17) reduces to 0 f(xk) + g(xk), which is exactly (11). Hence, xk is a solution of (1). In the variable metric framework, depending on the choice of Hk, the iteration scheme (14) leads to diﬀerent methods for solving (1). For instance:

1. If Hk := 2f(xk), then the method (14) is a proximal-Newton method.

2. If Hk is a symmetric positive deﬁnite matrix approximation of 2f(xk), then the method (14) is a proximal-quasi Newton method.

3. If Hk := Lk I, where Lk is, say, an approximation for the local Lipschitz constant of f and I is the identity matrix, then the method (14) is a proximal-gradient method.

Many of these above methods have been studied for (1) when f FL: cf., (Beck and Teboulle, 2009a; Becker and Fadili, 2012; Chouzenoux et al., 2013; Lee et al., 2012). Note however that, since the self-concordant part f of F is not (necessarily) globally Lipschitz continuously diﬀerentiable, these approaches are generally not applicable in theory. Given the search direction dk deﬁned by (16), we deﬁne the following proximal-Newton decrement4 λk and the weighted [semi-]norm βk

λk := dk xk = (dk)T 2f(xk)dk 1/2 and βk := dk Hk. (18)

In the sequel, we study three diﬀerent instances of the variable metric strategy in detail. Since we do not assume 2f(xk) 0, λk = 0 may not imply dk = 0.

Remark 5 If g 0 and 2f(xk) Sn ++, then dk = 2f(xk) 1 f(xk) is the standard Newton direction. In this case, λk deﬁned by (18) reduces to λk f(xk) xk, the Newton decrement deﬁned in (Nesterov, 2004, Chapter 4). Moreover, we have λk λ(xk), as deﬁned in Lemma 4.

3.1 A Proximal-Newton Method

If we choose Hk := 2f(xk), then the method described in (14) is called the proximal Newton algorithm. For notational ease, we redeﬁne sk n := sk and dk n := dk, where the subscript n is used to distinguish proximal Newton related quantities from the other variable

4. This notion is borrowed from standard the Newton decrement deﬁned in (Nesterov, 2004, Chapter 4).

Tran-Dinh, Kyrillidis, and Cevher

metric strategies. Moreover, we use the shorthand notation P g x := P g 2f( x), whenever x

dom(f). Using (15) and (16), sk n and dk n are given by

sk n := P g xk 2f(xk)xk f(xk) , dk n := sk n xk. (19)

Then, the proximal-Newton method generates a sequence xk

k 0 starting from x0 dom(F) according to xk+1 := xk + αkdk n, (20)

where αk (0, 1] is a step size. If αk < 1, then the iteration (20) is called the damped proximal-Newton iteration. If αk = 1, then it is called the full-step proximal-Newton iteration. Global convergence: We ﬁrst show that with an appropriate choice of the step-size αk (0, 1], the iterative sequence xk

k 0 generated by the damped-step proximal Newton scheme (20) is a decreasing sequence; i.e., F(xk+1) F(xk) ω(σ) whenever λk σ, where σ > 0 is ﬁxed. The following theorem provides an explicit formula for the step size αk whose proof can be found in Appendix A.2.

Theorem 6 If αk := (1 + λk) 1 (0, 1], then the scheme in (20) generates xk+1 satisﬁes

F(xk+1) F(xk) ω(λk). (21)

Moreover, the step αk is optimal. The number of iterations to reach the point xk such that λk < σ for some σ (0, 1) is kmax := j F(x0) F(x )

ω(σ) k + 1.

Local quadratic convergence rate: For any x dom(f) such that 2f(x) 0, we deﬁne the Dikin ellipsoid W0(x, r) as W0(x, r) := y dom(f) : y x x < r , see (Nesterov, 2004). We now establish the local quadratic convergence of the scheme (20). A complete proof of this theorem can be found in Appendix A.3.

Theorem 7 Suppose that x is the unique solution of (1) and is strongly regular. Suppose further that 2f(x) 0 for all x W0(x , 1). Let xk

k 0 be a sequence generated by the proximal Newton scheme (20) with αk (0, 1]. Then:

a) If αkλk < 1 1

2, then it holds that

λk+1 1 αk + (2α2 k αk)λk 1 4αkλk + 2α2 kλ2 k

b) If the sequence xk

k 0 is generated by the damped proximal-Newton scheme (20),

starting from x0 such that λ0 σ :=

5 2 0.236068 and αk := (1 + λk) 1, then {λk}k locally converges to 0+ at a quadratic rate.

c) Alternatively, if the sequence xk

k 0 is generated by the full-step proximal-Newton

scheme (20) starting from x0 such that λ0 σ := 0.25(5

17) 0.219224 and αk = 1, then {λk}k locally converges to 0+ at a quadratic rate.

Composite Self-concordant Minimization

Consequently, the sequence xk

k 0 also locally converges to x at a quadratic rate in both cases b) and c), i.e., xk x x

k 0 locally converges to 0+ at a quadratic rate.

A two-phase algorithm for solving (1): Now, by the virtue of the above analysis, we can propose a two-phase proximal-Newton algorithm for solving (1). Initially, we perform the damped-step proximal-Newton iterations until we reach the quadratic convergence region (Phase 1). Then, we perform full-step proximal-Newton iterations, until we reach the desired accuracy (Phase 2). The pseudocode of the algorithm is presented in Algorithm 1.

Algorithm 1 (Proximal-Newton algorithm)

Inputs: x0 dom(F), tolerance ε > 0.

Initialization: Select a constant σ (0, (5

17) 4 ], e.g., σ := 0.2.

for k = 0 to Kmax do

1. Compute the proximal-Newton search direction dk n as in (19). 2. Compute λk := dk n xk. 3. if λk > σ then xk+1 := xk + αkdk n, where αk := (1 + λk) 1. 4. elseif λk > ε then xk+1 := xk + dk n. 5. else terminate. end for

The radius σ of the quadratic convergence region in Algorithm 1 can be ﬁxed at any value in (0, σ], e.g., at its upper bound σ. An upper bound Kmax of the iterations can also be speciﬁed, if necessary. The computational bottleneck in Algorithm 1 is typically incurred Step 1 in Phase 1 and Phase 2, where we need to solve the subproblem (15) to obtain a search direction dk n. When problem (15) is strongly convex, i.e., 2f(xk) Sn ++, one can apply ﬁrst order methods to eﬃciently solve this problem with a linear convergence rate (see, e.g., Beck and Teboulle 2009a; Nesterov 2004, 2007) and make use of a warm-start strategy by employing the information of the previous iterations.

Remark 8 From Remark 3 we see that if f(xk) 0, then λk = 0 may not imply dk = 0. Therefore, we can add an auxiliary stopping criterion βk := dk 2 ε to Algorithm 1 so that we can avoid the termination of Algorithm 1 at a non-optimal point xk.

Iteration-complexity analysis.The choice of σ in Algorithm 1 can trade-oﬀthe number of iterations between the damped-step and full-step iterations. If we ﬁx σ = 0.2, then the complexity of the full-step Newton phase becomes O ln ln 0.28

ε . The following theorem summarizes the complexity of the proposed algorithm.

Theorem 9 The maximum number of iterations required in Algorithm 1 does not exceed Kmax := j F(x0) F(x )

0.017 k + 1.5 ln ln 0.28

ε + 2 provided that σ = 0.2 to obtain λk ε.

Consequently, xk x x 2ε, where x is the unique solution of (1).

Proof Let σ = 0.2. From the estimate (22) of Theorem 7 and αk 1 = 1 we have λk (1 4λk 1 + 2λ2 k 1) 1λ2 k 1 for k 1. Since λ0 σ, by induction, we can easily show

Tran-Dinh, Kyrillidis, and Cevher

that λk (1 4σ + 2σ2) 1λ2 k 1 cλ2 k 1, where c := 3.57. This implies λk c2k 1λ2k 0 c2k 1σ2k. The stopping criterion λk ε in Algorithm 1 is ensured if (cσ)2k cε. Since cσ 0.71 < 1, the last condition leads to k (ln 2) 1 ln ln(cσ)

ln(cε) . By using c = 3.57,

σ = 0.2 and the fact that ln(2) 1 < 1.5, we can show that the last requirement is fulﬁlled if k 1.5 ln ln 0.28

ε + 1. Now, combining the last conclusion and Theorem 6 with noting that ω(σ) > 0.017 we obtain Kmax as in Theorem 9.

Finally, we prove xk x x 2ε. Indeed, we have rk := xk x x xk+1 xk xk 1 xk x x +

xk+1 xk x = λk 1 rk + rk+1, whenever rk < 1. Next, using (84) with αk = 1, we have

rk+1 (3 rk)r2 k 1 4rk+2r2 k . Combining these inequalities, we obtain (1 rk)(1 7rk+3r2 k)rk 1 4rk+2r2 k λk ε.

Since the function s(r) := (1 r)(1 7r+3r2)r

1 4r+2r2 attains a maximum at r 0.08763 and it

is increasing on [0, r ]. Moreover, (1 rk)(1 7rk+3r2 k) 1 4rk+2r2 k 0.5 for rk [0, r ], which leas to

0.5rk (1 rk)(1 7rk+3r2 k)rk 1 4rk+2r2 k ε. Hence, rk 2ε provided that rk r0 r 0.08763.

Remark 10 When g 0, we can modify the proof of estimate (22) to obtain a tighter

bound λk+1 λ2 k (1 λk)2 . This estimate is exactly (Nesterov, 2004), which implies that the

radius of the quadratic convergence region is σ := (3

A modiﬁcation of the proximal-Newton method: In Algorithm 1, if we remove Step 4 and replace analytic step-size selection calculation in Step 3 with a backtracking line-search, then we reach the proximal Newton method of (Lee et al., 2012). Hence, this approach in practice might lead to reduced overall computation since our step-size αk is selected optimally with respect to the worst case problem structures as opposed to the particular instance of the problem. Since the backtracking approach always starts with the full-step, we also do not need to know whether we are within the quadratic convergence region. Moreover, the cost of evaluating the objective at the full-step in certain applications may not be signiﬁcantly worse than the cost of calculating αk or may be dominated by the cost of calculating the Newton direction. In stark contrast to backtracking, our new theory behooves us to propose a new forward line-search procedure as illustrated by Figure 2. The idea is quite simple: we start with the

0 1 ppppppppppppppp

Enhanced backtracking @@ R

Standard backtracking

Forward line-search

Figure 2: Illustration of step-size selection procedures.

optimal step-size αk and increase it towards full-step with a stopping condition based on the objective evaluations. Interestingly, when we analytically calculate the step, we also have

Composite Self-concordant Minimization

access to the side information on whether or not we are within the quadratic convergence region, and hence, we can automatically switch to Step 4 in Algorithm 1. Alternatively, calculation of the analytic step-size can enhance backtracking since the knowledge of αk reduces the backtracking range from (0, 1] to (αk, 1] with the side-information as to when to automatically take the full-step without function evaluation.

3.2 A Proximal Quasi-Newton Scheme

Even if the function f is self-concordant, the numerical evaluation of 2f(x) can be expensive in many applications (e.g., f(x) := Pp j=1 fj(Ajx), with p n). Hence, it is interesting to study proximal quasi-Newton method for solving (1). Our interest in the quasi-Newton methods in this paper is for completeness; we do not provide any algorithmic details or implementations on our quasi-Newton variant. To this end, we need a symmetric positive deﬁnite matrix Hk that approximates 2f(xk) at the iteration k. As a result, our main assumption here is that matrix Hk+1 at the next iteration k + 1 satisﬁes the secant equation:

Hk+1(xk+1 xk) = f(xk+1) f(xk). (23)

For instance, it is well-known that the sequence of matrices {Hk}k 0 updated by the following BFGS formula satisﬁes the secant equation (23) (Nocedal and Wright, 2006):

Hk+1 := Hk + 1 (yk)T zk yk(yk)T 1 (zk)T Hkzk Hkzk(Hkzk)T , (24)

where zk := xk+1 xk and yk := f(xk+1) f(xk). Other methods for updating matrix Hk can be found in (Nocedal and Wright, 2006), which are not listed here. In this subsection, we only analyze the full-step proximal quasi-Newton scheme based on the BFGS updates. The global convergence characterization of the BFGS quasi-Newton method can be obtained using our analysis in the next subsection. To this end, we have the following update equation, where the subscript q is used to distinguish the quasi-Newton method: xk+1 := xk + dk q. (25)

Here we use dk q to stand for the proximal quasi-Newton search direction. Under certain assumptions, one can prove that the sequence xk

k 0 generated by (25) converges to x the unique solution of (1). One of the common assumptions used in quasi Newton methods is the Dennis-Mor e condition, see (Dennis and Mor e, 1974). Adopting the Dennis-Mor e criterion, we impose the following condition in our context:

Hk 2f(x ) (xk+1 xk) x xk+1 xk x = 0. (26)

The Dennis-Mor e condition becomes standard in smooth optimization. Examples can be found, e.g., in (Byrd and Nocedal, 1989; Nocedal and Wright, 2006). Now, we establish the superlinear convergence of the sequence xk

k 0 generated by (25) as follows.

Tran-Dinh, Kyrillidis, and Cevher

Theorem 11 Assume that x is the unique solution of (1) and is strongly regular. Let matrix Hk maintains the secant equation (23) and let xk

k 0 be a sequence generated by scheme (25). Then the following statements hold:

(a) Suppose, in addition, that the sequence of matrices {Hk}k 0 satisﬁes the Dennis-Mor e condition (26) for suﬃciently large k. Then the sequence xk

k 0 converges to the solution x of (1) at a superlinear rate provided that x0 x x < 1.

(b) Suppose that a matrix H0 0 is chosen. Then (yk)T zk > 0 for all k 0 and hence the sequence {Hk}k 0 generated by (24) is symmetric positive deﬁnite and satisﬁes the secant equation (23). Moreover, if the sequence xk

k 0 generated by (25) satisﬁes P k=0 xk x x < + , then this sequence converges to x at a superlinear rate.

The proof of this theorem can be found in Appendix A.3. We note that if the sequence xk

k 0 locally converges to x at a linear rate w.r.t. the local norm at x , i.e. xk+1 x x κ xk x x for some κ (0, 1) and k 0, then the condition P k=0 xk x x < + automatically holds. From (26) we also observe that the matrix Hk is required to well approximate 2f(x ) along the direction dk q, which is not in the whole space.

3.3 A Proximal-Gradient Method

If we choose matrix Hk := Dk, where Dk is a positive diagonal matrix, then the iterative scheme (14) is called the proximal-gradient scheme. In this case, we can write (14) as

xk+1 := xk + αkdk g = (1 αk)xk + αksk g, (27)

where αk (0, 1] is an appropriate step size, dk g is the proximal-gradient search direction and sk g sk as in (15). The following lemma shows how we can choose the step size αk corresponding to Dk such that we obtain a descent direction in the proximal-gradient scheme (27). The proof of this lemma can be found in Appendix A.2.

Lemma 12 Let xk

k 0 be a sequence generated by (27). Suppose that the matrix Dk 0

is chosen such that the step size αk satisﬁes αk := β2 k λk(λk+β2 k) (0, 1] (see below), where

βk := dk g Dk and λk := dk g xk. Then xk

k 0 dom(F) and the following estimate holds F(xk+1) F(xk) ω β2 k/λk , (28)

where ω(τ) := τ ln(1 + τ) 0.

From Lemma 12, we observe that αk 1 if λ2 k β2 k + λk 1. It is obvious that if λk 1 then the last condition is automatically satisﬁed. We only consider the case λk < 1. In fact, since λk 0, we relax actually the condition λ2 k β2 k + λk 1 to a simpler condition λk βk.

Composite Self-concordant Minimization

Algorithm 2 (Proximal-gradient method)

Inputs: x0 dom(F), tolerance ε > 0.

for k = 0 to kmax do

1. Choose an appropriate Dk 0 based on (30). 2. Compute dk g := Pg Dk Dkxk f(xk) xk due to (15). 3. Compute βk := dk g Dk and λk := dk g xk. 4. If ek := dk g 2 ε then terminate.

5. Update xk+1 := xk + αkdk g, where αk := β2 k λk(λk+β2 k) (0, 1]. end for

We now study the case Dk := Lk I, where Lk L > 0 is a positive constant and I is the identity matrix with dimensions apparent from the context. Hence, β2 k = Lk dk g 2 2 and

λ2 k β2 k = (dk g)T 2f(xk)dk g Lk dkg 2 2 .

However, since

σmin( 2f(xk)) σk := (dk g)T 2f(xk)dk g dkg 2 2 σmax( 2f(xk)), (29)

the condition λk βk is equivalent to

Lk σk, (30)

where σk min := σmin( 2f(xk)) and σk max := σmax( 2f(xk)) are the smallest and largest eigenvalue of 2f(xk), respectively. Under the assumption that dom(f) contains no straightline, then we have the Hessian 2f(xk) 0 by (Nesterov, 2004, Theorem 4.1.3), which implies that σk min > 0. Therefore, in the worst-case, we can choose Lk := σk min. However, this lower bound may be too conservative. In practice, we can apply a bisection procedure to meet the condition (30). It is not diﬃcult to prove via contradiction that the number of bisection steps is upper bounded by a constant. We note that if g is separable, i.e., g(x) := Pn i=1 gi(xi) (e.g., g(x) := ρ x 1), then we can compute sk Dk in (15) in a component-wise fashion as

(sk Lk)i := Pgi τ k i

xk i τ k i ( f(xk))i , i = 1, . . . , n, (31)

where τ k i := 1/(Dk)ii and Pgi τi ( ) is the proximity operator of gi function, with parameter τi. The computation of λk only requires one matrix-vector multiplication and one vector inner-product; but it can be reduced by exploiting concrete structure of the smooth part f. Based on Lemma 12, we describe the proximal-gradient scheme (27) in Algorithm 2. The main computation cost of Algorithm 2 is incurred at Step 2 and in calculating λk. If g is separable, then the computation of Step 2 can be done in a closed form. One main

Tran-Dinh, Kyrillidis, and Cevher

step of Algorithm 2 is Step 2, which depends on the cost of prox-operator Pg Dk. In practice, Dk is determined by a bisection procedure whenever λk < 1, which requires additional computational cost. If we choose Dk := Lk I, then in order to fulﬁll (30), we can perform a back-tracking line search procedure on Lk. This line search procedure does not require the evaluations of the objective function. We modify Steps 1-3 of Algorithm 2 as

1. Initialize Lk := L0 k > 0, e.g., by a Barzilai-Borwein step.

2. Compute dk g := Pg Lk Ik Lkxk f(xk) xk due to (15).

3a. Compute βk := dk g Lk I and λk := dk g xk.

3b. If λ2 k/β2 k + λk < 1, then set Lk := Lk/2 and go back to Step 2.

We note that computing λk at Step 3 does not need to form the full Hessian 2f(xk), it only requires a directional derivative, which is relatively cheap in applications (Nocedal and Wright, 2006, Chapter 7). Global and local convergence. The global and local convergence of Algorithm 2 is stated in the following theorems, whose proof can be found in Appendix A.2.

Theorem 13 Assume that there exists L > 0 such that Dk LI for k 0, and the solution x of (1) is unique. Let the sublevel set

LF (F(x0)) := x dom(F) | F(x) F(x0)

be bounded. Then, the sequence xk

k 0, generated by Algorithm 2, converges to the unique solution x of (1).

Theorem 14 Assume that x is the unique solution of (1) and is strongly regular. Let xk

k 0 be the sequence generated by Algorithm 2. Then, for k suﬃciently large, if

[Dk 2f(x )]dk g x dkg x < 1

k 0 locally converges to x at a linear rate. In particular, if Dk := Lk I and

γ := max n 1 Lk σ min

, 1 Lk σ max

2, then the condition (32) holds.

We note that x is unknown; thus, evaluating γ a priori is infeasible in reality. In implementation, one can choose an appropriate value Lk L > 0 and then adaptively update Lk based on the knowledge of the eigenvalues of 2f(xk) near to the solution x . The condition (32) can be expressed as (dk g)T [L2 k 2f(x ) 1+ 2f(x ) 2Lk I]dk g (1/4) dk g 2 x , which leads to

(3/4) dk g 2 x + L2[ dk g x ]2 < 2Lk dk g 2 2. (33)

We note that to ﬁnd Lk such that (33) holds, we require dk g x dk g x < q

4 3 dk g 2 2. If the last condition in Theorem 14 is satisﬁed then the condition (33) also holds. While the

Composite Self-concordant Minimization

last condition in Theorem 14 seems too imposing, we claim that, for most f and g, we only require (33) to be satisﬁed (see also the empirical evidence in Subsection 5.2.1). The condition (32) (or (33)) can be referred to as a restricted approximation gap between Dk and the true Hessian 2f(x ) along the direction dk g for k suﬃciently large. For instance, when g is based on the ℓ1-norm/the nuclear norm, the search direction dk g have at most twice the sparsity/rank of x near the convergence region.

Remark 15 From the scheme (27) we observe that the step size αk < 1 may not preserve some of the desiderata on xk+1 due to the closed form solution of the prox-operator Pg Dk. For instance, when g is based on the ℓ1-norm, αk < 1, might increase the sparsity level of the solution as opposed to monotonically increasing it. However, in practice, the numerical values of αk are often 1 near the convergence, which maintain properties, such as sparsity, low-rankedness, etc.

Global convergence rate: In proximal gradient methods, proving global convergence rate guarantees requires a global constant to be known a priori such as the Lipschitz constant. However such an assumption does not apply for the class of just self-concordant functions that we consider in this paper. We only characterize the following property in an ergodic sense. Let xk

k 0 be the sequence generated by (2). We deﬁne

xk := S 1 k

j=0 αjxj, where Sk :=

j=0 αj > 0. (34)

Then we can show that F( xk) F Lk 2Sk x0 x 2 2, where Lk := max 0 j k Lj. If αj α > 0

for 0 j k, then Sk α(k + 1), which leads to F( xk) F Lk 2(k+1)α

x0 x 2 2. The proof of this statement can be found in (Tran-Dinh et al., 2014b), which we omit here. A modiﬁcation of the proximal-gradient method: If the point sk g generated by (15) belongs to dom(F), then F(sk g) < + . Similarly to the deﬁnition of xk+1 in (27), we can deﬁne a new trial point: ˆxk := (1 αk)xk + αksk g. (35)

If F(sk g) F(xk), then, by the convexity of F, it is easy to show that:

F(ˆxk) = F (1 αk)xk + αksk g (1 αk)F(xk) + αk F(sk g) F(sk g) F(xk) F(xk).

In this case, based on the function values F(sk g), F(ˆxk) and F(xk) we can eventually choose the next iteration xk+1 as follows:

xk+1 := sk g if sk dom(F) and F(sk g) < F(ˆxk) (Case 1), ˆxk otherwise (Case 2). (36)

The idea of this greedy modiﬁcation is illustrated in Figure 3. We note that here we need to check sk g dom(F) such that F(sk g) < F(xk) and additional function evaluations F(sk g) and F(ˆxk). However, careful implementations can recycle quantities that enable us to evaluate the objective at sk g and at xk+1 with very little overhead over the calculation of αk (see Section 4). By using (36), we can specify a modiﬁed proximal gradient algorithm for solving (1), whose details we omit here since it is quite similar to Algorithm 2.

Tran-Dinh, Kyrillidis, and Cevher

0 x xk sk g sk g x ˆxk

F(sk g) F(ˆxk)

F(x) Q( ; xk, Hk)

ˆxk :=(1 αk)xk+αksk g

ˆxk := sk g Case 1

Figure 3: Illustration of the modiﬁed proximal-gradient method

4. Concrete Instances of our Optimization Framework

We illustrate three instances of our framework for some of the applications described in Section 1. For concreteness, we describe only the ﬁrst and second order methods. Quasi Newton methods based on (L-)BFGS updates or other adaptive variable metrics can be similarly derived in a straightforward fashion.

4.1 Graphical Model Selection

We customize our optimization framework to solve the graph selection problem (2). For notational convenience, we maintain a matrix variable Θ instead of vectorizing it. We observe that f(Θ) := log(det(Θ)) + tr( ˆΣΘ) is a standard self-concordant function, while g(Θ) := ρ vec(Θ) 1 is convex and nonsmooth. The gradient and the Hessian of f can be computed explicitly as f(Θ) := ˆΣ Θ 1 and 2f(Θ) := Θ 1 Θ 1, respectively. Next, we formulate our proposed framework to construct two algorithmic variants for (2).

4.1.1 Dual Proximal-Newton Algorithm

We consider a second order algorithm via a dual solution approach for (15). This approach is ﬁrst introduced in our earlier work (Tran-Dinh et al., 2013a), which did not consider the new modiﬁcations we propose in Section 3.1. We begin by deriving the following dual formulation of the convex subproblem (15). Let pk := f(xk), the convex subproblem (15) can then be written equivalently as

n (1/2)x T Hkx + (pk Hkxk)T x + g(x) o . (37)

By using the min-max principle, we can write (37) as

max u Rn min x Rn

n (1/2)x T Hkx + (pk Hkxk)T x + u T x g (u) o , (38)

where g is the Fenchel conjugate function of g, i.e., g (u) := sup x

u T x g(x) . Solving

the inner minimization in (38) we obtain

min u Rn (1/2)u T H 1 k u + p T k u + g (u) , (39)

Composite Self-concordant Minimization

where pk := H 1 k pk xk. Note that the objective function ϕ(u) := g (u)+(1/2)u T H 1 k u+ p T k u of (39) is strongly convex, one can apply the fast projected gradient methods with a linear convergence rate for solving this problem, see (Nesterov, 2007; Beck and Teboulle, 2009a). In order to recover the solution of the primal subproblem (15), we note that the solution of the parametric minimization problem in (38) is given by x (u) := xk H 1 k (pk +u). Let u xk be the optimal solution of (39). We can recover the primal proximal-Newton search direction dk deﬁned in (16) as

dk n = 2f(xk) 1 f(xk) + u xk . (40)

To compute the quantity λk deﬁned by (18) in Algorithm 1, we use (40) such that:

λk = dk n xk = f(xk) + u xk xk. (41)

Note that computing λk by (41) requires the inverse of the Hessian matrix 2f(xk). Surprisingly, this dual approach allows us to avoid matrix inversion as well as Cholesky decomposition in computing the gradient f(Θi) and the Hessian 2f(Θi) of f in graph selection. An alternative is of course to solve (15) in its primal form. Though, in such case, we need to compute Θ 1 i at each iteration i (say, via Cholesky decompositions). The dual subproblem (39) becomes as

U = arg min vec(U) 1

n (1/2)tr((Θi U)2) + tr( e QU) o , (42)

for the graph selection, where e Q := ρ 1[Θi bΣΘi 2Θi]. Given the dual solution U of (42), the primal proximal-Newton search direction (i.e. the solution of (15)) is computed as

i := (Θi bΣ I)Θi + ρΘi U Θi . (43)

The quantity λi deﬁned in (41) can be computed as follows, where Wi := Θi(bΣ + ρU ):

λi := p 2 tr (Wi) + tr W2 i 1/2 . (44)

Algorithm 3 summarizes the description above. Overall, this proximal-Newton (PN) algorithm does not require any matrix inversions or Cholesky decompositions. It only needs matrix-vector and matrix-matrix calculations, which might be attractive for diﬀerent computational platforms (such as GPUs or simple parallel implementations). Note however that as we work through the dual problem, the primal solution can be dense even if majority of the entries are rather small (e.g., smaller than 10 6).5

We now explain the underlying costs of each step in Algorithm 3, which is useful when we consider diﬀerent strategies for the selection of the step size αk. The computation of e Q and i require basic matrix multiplications. For the computation of λi, we require two trace operations: tr(Wi) in O(p) time-complexity and tr(W2 i ) in O(p2) complexity. We note here

5. In our MATLAB implementation below, we have not exploited the fact that the primal solutions are sparse. The overall eﬃciency can be improved via thresholding tricks, both in terms of time-complexity (e.g., less number of iterations) and matrix estimation quality.

Tran-Dinh, Kyrillidis, and Cevher

Algorithm 3 (Dual PN for graph selection (DPNGS))

Input: Matrix bΣ 0 and a given tolerance ε > 0. Set σ := 0.25(5

17). Initialization: Find a starting point Θ0 0. for i = 0 to imax do

1. Set e Q := ρ 1 Θi bΣΘi 2Θi . 2. Compute U in (42). 3. Compute λi by (44), where Wi :=Θi(bΣ+ρU ). 4. If λi ε terminate. 5. Compute i := (Θi bΣ I)Θi + ρΘi U Θi .

6. If λi > σ, then set αi := (1 + λi) 1. Otherwise, set αi = 1. 7. Update Θi+1 := Θi + αi i. end for

that, while Wi is a dense matrix, the trace operation in the latter case requires only the computation of the diagonal elements of W2 i . Given Θi, αi and i, the calculation of Θi+1 has O(p2) complexity. In contrast, evaluation of the objective can be achieved through Cholesky decompositions, which has O(p3) time complexity. To compute (42), we can use the fast proximal-gradient method (FPGM) (Nesterov, 2007; Beck and Teboulle, 2009a) with step size 1/L where L is the Lipschitz constant of the gradient of the objective function in (42). It is easy to observe that L := γ2 max(Θi) where γmax(Θi) is the largest eigenvalue of Θi. For sparse Θi, we can approximately compute γmax(Θi) is O(p2) by using iterative power methods (typically, 10 iterations suﬃce). The projection onto vec(U) 1 clips the elements by unity in O(p2) time. Since FPGM requires a constant number of iterations kmax (independent of p) to achieve an εin solution accuracy, the time-complexity for the solution in (42) is O(kmax M), where M is the cost of matrix multiplication. We have also implemented block coordinate descent and active set methods which scale O(p2) in practice when the solution is quite sparse. Overall, the major operation with general proximal maps in the algorithm is typically the matrix-matrix multiplications of the form Θi UΘi, where Θi and U are symmetric positive deﬁnite. This operation can naturally be computed (e.g., in a GPU) in a parallel or distributed manner. For more details of such computations we refer the reader to (Bertsekas and Tsitsiklis, 1989). It is important to note that without Cholesky decompositions used in objective evaluations, the basic DPNGS approach theoretically scales with the cost of matrix-matrix multiplications.

4.1.2 Proximal-Gradient Algorithm

Since g(Θ) := ρ vec(Θ) 1 and f(Θi) = vec(bΣ Θ 1 i ), the subproblem (15) becomes:

i+1 := Tτiρ Θi τi(bΣ Θ 1 i ) Θi, (45)

where Tτ : Rp p Rp p is the component-wise matrix thresholding operator which is deﬁned as Tτ(Θ) := max {0, |Θ| τ}. We also note that the computation of i+1 requires a matrix inversion Θ 1 i . Since Θi is positive deﬁnite, one can apply Cholesky decompositions to compute Θ 1 i in O(p3) operations. To compute the quantity λi, we have λi := i Θi =

Composite Self-concordant Minimization

Θ 1 i i 2. We also choose Li := 0.5 2f(Θi) 2 = 0.5 Θ 1 i 2 2. The above are summarized in Algorithm 4.

Algorithm 4 (Proximal-gradient method for graph selection (Prox Grad1))

Initialization: Choose a starting point Θ0 0 . for i = 0 to imax do

1. Compute Θ 1 i via Cholesky decomposition. 2. Choose Li satisfying (30) and set τi := L 1 i . 3. Compute the search direction i as (45). 4. Compute βi := Li vec( i) 2 and λi := Θ 1 i i 2. 5. Determine the step size αi := βi λi(λi+βi). 6. Update Θi+1 := Θi + αi i. end for

The per iteration complexity is dominated by matrix-matrix multiplications and Cholesky decompositions for matrix inversion calculations. In particular, Step 1 requires a Cholesky decomposition with O(p3) time-complexity. Step 2 requires to compute ℓ2-norm of a symmetric positive matrix, which can be done by a power-method in O(p2) time-complexity. The complexity of Steps 3, 4 and 6 requires O(p2) operations. Step 2 may require additional bisection steps as mentioned in Algorithm 2 whenever λk < 1.

4.2 Poisson Intensity Reconstruction

We now describe a variant of Algorithm 2; a similar instance based on Algorithm 1 can be easily devised and we omit the details here. First, we can easily check that the function f(x) := Pm i=1 a T i x yi log(a T i x) in (3) is convex and self-concordant with parameter

M f := 2 max n 1 yi | yi > 0, i = 1, . . . , m o , see (Nesterov, 2004, Theorem 4.1.1). We deﬁne the functions f and g as

f(x) := M2 f 4 f(x), g(x) := M2 f 4 ρφ(x) + δ{u | u 0}(x) , (46)

where f and g satisfy Assumption 1 and δC is the indicator function of C. Thus, the problem in (3) can be equivalently transformed into (1). Here, the gradient and the Hessian of f satisfy:

f(x) = M2 f 4

1 yi a T i x

ai and 2f(x) = M2 f 4

yi (a T i x)2 aia T i , (47)

respectively. For a given vector d Rn, the local norm d x can then be written as

d x := d T 2f(x)d 1/2 = M f

yi(a T i d)2

!1/2 . (48)

Computing this quantity requires one matrix-vector multiplication and O(m) operations.

Tran-Dinh, Kyrillidis, and Cevher

For the Poisson model, the subproblem (15) is expressed as follows:

n (1/2) x wk 2 2 + ρkφ(x) o , (49)

where wk := xk L 1 k f(xk) and ρk := ρM2 f 4Lk . As a penalty function φ in the Poisson intensity reconstruction, we use the Total Variation norm (TV-norm), deﬁned as φ(x) := Dx 1 (isotropic) or φ(x) := Dx 1,2 (anti-isotropic), where D is a forward linear operator (Chambolle and Pock, 2011; Beck and Teboulle, 2009b). For both TV-norm regularizers, the method proposed in (Beck and Teboulle, 2009b) can solve (49) eﬃciently. The above discussion leads to Algorithm 5. We note that the constant Lk at Step 2 of this algorithm can be estimated based on diﬀerent rules. In our implementation below, we initialize Lk at a Barzilai-Borwein step size, i.e., Lk := ( f(xk) f(xk 1))T (xk xk 1)

xk xk 1 2 2 and may perform a few backtracking iterations on Lk to ensure the condition (30) whenever λk < 1.

Algorithm 5 (Prox Grad for Poisson intensity reconstruction (Prox Grad2))

Inputs: x0 0, ε > 0 and ρ > 0. Compute M f := 2 max n 1 yi | yi > 0, i = 1, . . . , m o . for k = 0 to kmax do

1. Evaluate the gradient of f as (47). 2. Compute an appropriate value Lk > 0 that satisﬁes (30). 3. Compute ρk := 0.25ρM2 f L 1 k and wk := xk L 1 k f(xk).

4. Compute sk g by solving (49) and then compute dk g := sk g xk. 5. Compute βk := Lk dk g 2 2 and λk := dk g xk as (48). 6. If ek := L 1 k βk ε then terminate. 7. Determine the step size αk := βk λk(λk+βk).

8. Update xk+1 := xk + αkdk g. end for

Note that we can modify Step 8 in Algorithm 5 by using the update scheme (36) to obtain a new variant of this algorithm. We omit the details here.

4.3 Heteroscedastic LASSO

We focus on a convex formulation of the unconstrained LASSO problem with unknown variance studied in (St adler et al., 2012) as

(β , σ ) := arg min β Rp,σ R++

n log(σ) + (1/(2n)) Xβ σy 2 2 + ρ β 1 o . (50)

However, our algorithm can be applied to solve the multiple unknown variance case considered in (Dalalyan et al., 2013). By letting x := (βT , σ)T Rp+1, f(x) := log(σ) + (1/(2n)) Xβ σy 2 2. Then, it is easy to see that the function f is standard self-concordant. Hence, we can apply Algorithm 2 to solve this problem. To highlight the salient diﬀerences in the code, we note the following:

Composite Self-concordant Minimization

Deﬁne z := Xβ σy, then the gradient vector of function f can be computed as

f(x) := n 1z T X, σ 1 n 1y T z T .

This computation requires two matrix-vector multiplications and one inner product.

The quantity λk can be explicitly computed as

λk := σ 2 k + n 1y T y (dk σ)2 + n 1z T k zk 2n 1dk σy T zk 1/2 ,

where zk := Xdk β and dk g := ((dk β)T , dk σ)T is the search direction. This quantity requires one matrix-vector multiplication and two inner products. Moreover, this matrix-vector product can be reused to compute the gradient for the next iteration.

The ﬁnal algorithm is very similar to Algorithm 5 and hence we omit the details.

5. Numerical Experiments

In this section, we illustrate our optimization framework via numerical experiments on the variants discussed in Section 4. We only focus on proximal gradient and Newton variants and encourage the interested reader to try out the quasi-Newton variants for their own applications. All the tests are performed in MATLAB 2011b running on a PC Intel Xeon X5690 at 3.47GHz per core with 94Gb RAM.6

5.1 Proximal-Newton Method in Action

By using the graph selection problem, we ﬁrst show that the modiﬁcations on the proximal Newton method provides advantages in practical convergence as compared to state-of-theart strategies and provides a safeguard for line-search procedures in optimization routines. We then highlight the impact of diﬀerent subsolvers for (37) in the practical convergence of the algorithms.

5.1.1 Comparison of Different Step-Size Selection Procedures

We apply four diﬀerent step-size selection procedures in our proximal-Newton framework to solve problem (2). Speciﬁcally, we test the algorithm based on the following conﬁguration:

(i) We implement Algorithm 3 in MATLAB using FISTA (Beck and Teboulle, 2009a) to solve the dual subproblem with the following stopping criterion: Θi+1 Θi F 10 8 max { Θi+1 F , 1}.

(ii) We consider four diﬀerent globalization procedures, whose details can be found in Section 3.1: a) No LS which uses the analytic step size α k = (1+λk) 1, b) Btk LS which is an instance of the proximal-Newton framework of (Lee et al., 2012) and uses the standard backtracking line-search based on Armijo s rule, c) E-Btk LS which is based on the standard backtracking line-search enhanced by the lower bound α k and, d)

6. We also provide MATLAB implementations of the examples in this section as a software package (SCOPT) at http://lions.epﬂ.ch/software.

Tran-Dinh, Kyrillidis, and Cevher

Fw LS as the forward line-search by starting from α k and increasing the step size until either αk = 1, infeasibility or the objective value does not improve.

(iii) We test our implementation on four problem cases: The ﬁrst problem is a synthetic examples of size p = 10, where the data is generated as in (Kyrillidis and Cevher, 2013). We run this test for 10 times and report computational primitives in average. Three remaining problems are based on real data from http://ima.umn.edu/~maxxa007/send_SICS/, where the regularization parameters are chosen as the standard values (cf., Tran-Dinh et al. (2013a); Lee et al. (2012); Hsieh et al. (2011)). We terminate the proximal Newton scheme if λk 10 6.

The numerical results are summarized in Table 2. Here, #iter denotes the (average) number of iterations, #chol represents the (average) number of Cholesky decompositions and #Mm is the (average) number of matrix-matrix multiplications.

Synthetic (ρ = 0.01) Arabidopsis (ρ = 0.5) Leukemia (ρ = 0.1) Hereditary (ρ = 0.1)

LS Scheme #iter #chol #Mm #iter #chol #Mm #iter #chol #Mm #iter #chol #Mm

No LS 25.4 - 3400 18 - 1810 44 - 9842 72 - 20960 Btk LS 25.5 37.0 2436 11 25 718 15 50 1282 19 63 2006 E-Btk LS 25.5 36.2 2436 11 24 718 15 49 1282 15 51 1282 Fw LS 18.1 26.2 1632 10 17 612 12 34 844 14 44 1126

Table 2: Metadata for the line search strategy comparison

We can see that our new step-size selection procedure Fw LS shows superior empirical performance as compared to the rest: The standard approach No LS usually starts with pessimistic step-sizes which are designed for worst-case problem structures. Therefore, we ﬁnd it advantageous to continue with a forward line-search procedure. Whenever it reaches the quadratic convergence, no Cholesky decompositions are required. This makes a diﬀerence, compared to standard backtracking line-search Btk LS where we need to evaluate the objective value at every iteration. While there is no free lunch, the cost of computing λk is O(p2) in Fw LS, which turns out to be quite cheap in this application. The E-Btk LS combines both backtrack line-search and our analytic step-size α k := (1 + λk) 1, which outperforms Btk LS as the regularization parameter becomes smaller. Finally, we note that the No LS variant needs more iterations but it does not require any Cholesky decompositions, which might be advantageous in homogeneous computational platforms.

5.1.2 Impact of Different Solvers for the Subproblems

As mentioned in the introduction, an important step in our second order algorithmic framework is the solution of the subproblem (15). If the variable matrix Hk is not diagonal, computing sk Hk corresponds to solving a convex subproblem. For a given regularization term g, we can exploit diﬀerent existing approaches to tackle this problem. We illustrate that the overall framework to be quite robust against the solution accuracy of the individual subsolver. In this test, we consider the broad used ℓ1-norm function as the regularizer. Hence, (15) collapses to an unconstrained LASSO problem; cf. (Wright et al., 2009). To this end, we implement the proximal-Newton algorithm to solve the graph learning problem (2) where

Composite Self-concordant Minimization

Estrogen (p = 692) Arabidopsis (p = 834) Leukemia (p = 1255) Hereditary(p = 1869)

Sub-solvers #iter #chol time[s] #iter #chol time[s] #iter #chol time[s] #iter #chol time[s]

#nnz = 0.022p2 #nnz = 0.030p2 #nnz = 0.022p2 #nnz = 0.020p2

p FISTA 9 29 13.10 10 35 24.76 9 31 286.57 17 80 1608.66 p FISTA[gpu] 9 29 10.70 10 35 16.81 9 31 231.97 17 80 1265.97 d FISTA 8 16 4.66 10 17 10.92 14 22 50.19 14 27 147.86 d FISTA[gpu] 8 16 4.16 10 17 7.89 14 22 43.53 14 27 120.16 Fast AS 7 24 28.69 8 27 96.93 9 31 532.11 11 40 1682.28 BCDC 8 25 90.35 9 28 227.27 9 31 549.80 12 47 3452.82 Mat QUIC 11 29 21.61 10 35 50.67 10 35 119.06 14 44 891.29 Prox Grad1 175 175 8.82 226 226 17.78 230 230 44.06 660 660 350.52

#nnz = 0.072p2 ( 6%) #nnz = 0.074p2 #nnz = 0.065p2 #nnz = 0.063p2

p FISTA 34 101 357.25 57 148 1056.90 143 242 7490.27 - - - p FISTA[gpu] 34 101 300.90 57 148 730.07 143 242 6083.06 - - - d FISTA 14 32 12.51 12 35 15.53 12 34 38.73 14 44 150.03 d FISTA[gpu] 14 32 11.18 12 35 11.18 12 34 33.45 14 44 121.37 Fast AS - - - - - - - - - - - - BCDC 13 48 1839.17 15 50 4806.62 - - - - - - Mat QUIC 30 88 573.87 36 95 1255.13 36 95 4260.97 - - - Prox Grad1 4345 4345 224.95 6640 6640 532.77 9225 9225 1797.49 - - -

Table 3: Metadata for the subsolver eﬃciency comparison

g(x) := ρ x 1. To show the impact of the subsolver in (2), we implement the following methods, which are all available in our software package SCOPT:

(i) p FISTA and d FISTA: in these cases, we use the FISTA algorithm (Beck and Teboulle, 2009a) for solving the primal (37) and the dual subproblem (39). Moreover, to speedup the computations, we further run these methods on the GPU [NVIDIA Quadro 4000].

(ii) Fast AS: this method corresponds to the exact implementation of the fast active-set method proposed in (Kim and Park, 2010) for solving the primal-dual (37).

(iii) BCDC: here, we consider the block-coordinate descent method implemented in (Hsieh et al., 2011) for solving the primal subproblem (37).

We also compare the above variants of the proximal-Newton approach with (i) the proximalgradient method (Algorithm 4) denoted by Prox Grad1 and (ii) a precise MATLAB implementation of QUIC (Mat QUIC), as described in (Hsieh et al., 2011). For the proximal-Newton and Mat QUIC approaches, we terminate the execution if the maximum number of iterations exceeds 200 or the total execution time exceeds the 5 hours. The maximum number of iterations in Prox Grad1 is set to 104. The results are reported in Table 3. Overall, we observe that d FISTA shows superior performance across the board in terms of computational time and the total number of Cholesky decompositions required. Here, #nnz represents the number of nonzero entries in the ﬁnal solution. The notation indicates that the algorithms exceed either the maximum number of iterations or the time limit (5 hours).

Tran-Dinh, Kyrillidis, and Cevher

If the parameter ρ is relatively large (i.e., the solution is expected to be quite sparse), Fast AS, BCDC and Mat QUIC perform well and converge in a reasonable time. This is expected since all three approaches vastly rely on the sparsity of the solution: the sparser the solution is, the faster their computations are performed, as restricted on the active set of variables. However, when ρ is small, the performance of these methods signiﬁcantly degrade due to the increased number of active (non-zero) entries. Aside from the above, Prox Grad1 performs well in terms of computational time, as compared to the rest of the methods. Unfortunately, the number of Cholesky decompositions in this method can become as many as the number of iterations, which indicates a computational bottleneck in high-dimensional problem cases. Moreover, when ρ is small, this method also slows down and requires more iterations to converge. On the other hand, we also note that p FISTA is rather sensitive to the accuracy of the subsolver within the quadratic convergence region. In fact, while p FISTA reaches medium scale accuracies in a manner similar to d FISTA, it spends most of its iterations trying to achieve the higher accuracy values.

5.2 Proximal-Gradient Algorithm in Action

In this subsection, we illustrate the performance of proximal gradient algorithm in practice on various problems with diﬀerent regularizers.

5.2.1 Linear Convergence

To show the linear convergence of Prox Grad1 (Algorithm 2) in practice, we consider the following numerical test. Our experiment is based on the Lymph and Estrogen problems downloaded from http://ima.umn.edu/~maxxa007/send_SICS/. For both problem cases, we use diﬀerent values for ρ as ρ = [0.1 : 0.05 : 0.6] in MATLAB notation. For each conﬁguration, we measure the quantity:

(Dk 2f(x ))dk g x dkg x , (51)

for few last iterations. This quantity can be referred to as the restricted approximation gap of Dk to 2f(x ) along the proximal-gradient direction dk g. We ﬁrst run the proximal Newton method up to 10 16 accuracy to obtain the solution x and then run the proximalgradient algorithm up to 10 8 accuracy to compute ck res and the norm xk x x . From the proof of Theorem 14, we can show that if ck res < 0.5 for suﬃciently large k, then the sequence xk

k 0 locally converges to x at a linear rate. We note that this condition is much weaker than the last condition given in Theorem 14 but more diﬃcult to interpret. Note that the requirement in Theorem 14 leads to a restriction on the condition number of 2f(x ) to be less than 3. We perform this test on two problem instances with 11 diﬀerent values of the regularization parameter and then compute the median of ck res for each problem. Figure 4 shows the median of the restricted approximation gap ck res and the real condition number of 2f(x ), respectively. As expected, we observe that the real condition number of 2f(x ) increases as the regularization parameter decreases. Moreover, the last condition given in Theorem 14 does not hold in this example. However, if we look at the restricted condition number computed

Composite Self-concordant Minimization

0.2 0.4 0.6

ρ 0.2 0.4 0.6

cond( 2f (x ))

(a) Lymph data set (p = 578)

0.2 0.4 0.6

ρ 0.2 0.4 0.6

cond( 2f (x ))

(b) Estrogen data set (p = 692)

Figure 4: For each test case: (Left) Restricted approximation gap ck res (Right) The actual condition number of 2f(x ).

by (51), we can observe that for ρ 0.3, this value is strictly smaller than 0.5. In this case, the local linear convergence is actually observed in practice. While ck res < 0.5 is only a suﬃcient condition and can possibly be improved, we ﬁnd it to be a good indicator of the convergence behavior. Figure 5 shows the last 100 iterations of our gradient method for the Lymph problem with ρ = 0.15 and ρ = 0.55. The number of iterations needed to achieve the ﬁnal solution in these cases is 1525 and 140, respectively. In the former case, the calculated restricted condition number is above 0.5 and the ﬁnal convergence rate suﬀers. For instance, the contraction factor κ in the estimate xk+1 x x κ xk x x is close to 1 when ρ = 0.15, while it is smaller when ρ = 0.55. We can observe from Figure 5 (left) that the error xk x x drops rapidly at the last few iterations due to the aﬀect of the bisection procedure, where we check the condition (30) for λk < 1.

0 20 40 60 80 100

100 last iterations

xk x x in log-scale

0 20 40 60 80 100

100 last iterations

xk x x in log-scale based line with κ = 0.9935

xk x x in log-scale based line with κ = 0.9462

Figure 5: Linear convergence of Prox Grad1 for Lymph: Left: ρ = 0.15 and Right: ρ = 0.55.

5.2.2 TVℓ1-regularizer

In this experiment, we consider the Poisson intensity reconstruction problem, where the regularizer g, the TVℓ1-norm which is called the anisotropic-TV; as an example, cf. (Beck and Teboulle, 2009b). Hence, we implement Algorithm 5 (Prox Grad2) to solve (3), improve it using the greedy step-size modiﬁcation as described in Section 3.3 (Prox Grad2g), and

Tran-Dinh, Kyrillidis, and Cevher

compare its performance with the state-of-the-art Sparse Poisson Intensity Reconstruction Algorithms (SPIRAL-TAP) toolbox (Harmany et al., 2012). As a termination criterion, we have dk g 2 10 5 max 1, xk 2 or when the objective value does not signiﬁcantly change after 5 successive iterations, i.e., for each k, f(xk+j) f(xk) 10 8 max 1, f(xk) for j = 1, . . . , 5. We ﬁrst illustrate the convergence behavior of the three algorithms under comparison. We consider two image test cases: house and cameraman, and we set the regularization parameter of the TVℓ1-norm to ρ = 2.5 10 5. Figure 9 illustrate the convergence of the algorithms both in iteration count and the timing.

100 200 300 400 500 10 8

F ( xk) F ( x )

|F ( x ) | in log-scale

# of iterations

Prox Grad2 Prox Grad2g SPIRAL TAP

10 4 10 2 10 0 10 2 10 4 10 8

F ( xk) F ( x )

|F ( x ) | in log-scale

Time (sec.) in log-scale

Prox Grad2 Prox Grad2g SPIRAL TAP

100 200 300 400 500

F ( xk) F ( x )

|F ( x ) | in log-scale

# of iterations

Prox Grad2 Prox Grad2g SPIRAL TAP

10 4 10 2 10 0 10 2 10 4

F ( xk) F ( x )

|F ( x ) | in log-scale

Time (sec.) in log-scale

Prox Grad2 Prox Grad2g SPIRAL TAP

Figure 6: Convergence of three algorithms for house (top) and cameraman (bottom). Left: in iteration scale Right: in time log-scale. Overall, Prox Grad2g exhibits the best convergence behavior in terms of iterations and time. Due to the inaccurate solutions of the subproblem (49), the methods might exhibit oscillations. Since SPIRAL-TAP employs a Barzilai-Borwein step-size and performs a linesearch procedure up to very small step-size, the objective value is not suﬃciently decreased; as a result of this, we observe more oscillations in the objective value. In stark contrast, Prox Grad2 and Prox Grad2g use the Barzilai-Borwein step-size as an initial-guess for computing a search direction and then use the step-size correction procedure to ensure that the objective function decreases a certain amount at each iteration. This strategy turns out to be more eﬀective since milder oscillations in the objective values are observed in practice (which are due to the inaccuracy of the TV-proximal operator). Finally, we test the performance of Prox Grad2, Prox Grad2g and SPIRAL-TAP on 4 different image cases: barbara, cameraman, house and lena. We set ρ to two diﬀerent values: ρ {10 5, 2.5 10 5}. These values are chosen in order to obtain the best visual

Composite Self-concordant Minimization

Orginal image

Poisson noise image

Reconstructed image [Prox Grad2]

Reconstructed image [SPRIAL TAP]

Figure 7: The reconstructed images for barbara (ρ = 2.5 10 5)

reconstructions (e.g., see Figure 7) and are previously used in (Harmany et al., 2012). The summary results reported in Table 4. Here, AC denotes the multiplicative factor in time acceleration of Prox Grad2 as compared to SPIRAL-TAP, and F is the diﬀerence between the corresponding obtained objective values between Prox Grad2 and SPIRAL-TAP (a positive F means that SPIRAL-TAP obtains a higher objective value at termination).

Prox Grad2g / Prox Grad2 / SPIRAL-TAP

Image ρ 10 5 #iteration CPU time [s] AC F k min F

house 1.0 116 256 500 27.45 56.95 1658.00 60 29 -10718352.93 0.31 0.70 (256 256) 2.5 92 244 500 18.18 50.26 1431.94 79 28 -10711758.80 3.20 3.32

barbara 1.0 200 324 500 46.92 77.77 1204.36 26 15 -7388497.47 0.05 0.30 (256 256) 2.5 164 268 500 36.45 67.98 1620.95 44 24 -7377424.50 1.90 2.02

cameraman 1.0 396 516 500 99.56 117.75 389.79 4 3 -9186631.65 0.19 0.07 (256 256) 2.5 256 368 500 59.75 85.25 1460.62 24 17 -9175307.33 2.29 2.31

lena 1.0 152 220 500 27.43 41.31 1212.69 44 29 -5797053.79 0.10 0.10 (204 204) 2.5 304 184 500 59.20 36.77 1132.04 19 31 -5789554.53 1.52 1.25

Table 4: The results and performance of three algorithms

From Table 4 we observe that Prox Grad2 and Prox Grad2g are superior to SPIRAL-TAP, both in terms of CPU time and the ﬁnal objective value in majority of problems. As the table shows, Prox Grad2g can be 4 to 79 times faster than SPIRAL-TAP. Moreover, it reports a better objective values in all cases.

5.2.3 A Comparison to Standard Gradient Methods Based on FL Assumption

In this subsection, we use the LASSO problem (50) with unknown variance as a simple test case to illustrate the improvements over the standard methods. Note that the standard Lipschitz gradient assumption no longer holds in this example due to the log-term log(σ). For this comparison, we dub our algorithm as Prox Grad3(g) and compare it against a stateof-the-art TFOCS software package (Becker et al., 2011). The input data is synthetically generated based on the linear model y = Xβ + s, where β is the true sparse parameter vector; X is a Gaussian n p matrix and s N(0, σ2), where σ = 0.01. In TFOCS, we conﬁgure the Nesterov s accelerated algorithm with two proximal operations (TFOCS-N07) and adaptive restart as well as the standard gradient method (TFOCS-GRA). Both options

Tran-Dinh, Kyrillidis, and Cevher

use a backtracking step-size selection procedure due to the presence of the logarithmic term in the objective. As we can see in Figure 9 and Table 5 that Prox Grad3g performs the best and manages to converge to a high accuracy solution at a linear rate in both examples. Interestingly, we ﬁnd the per iteration complexity of Prox Grad3g is similar to Prox Grad3 and TFOCS-GRA. In terms of per iteration cost, TFOCS-N07 is the most expensive one as it uses dual prox operations and adaptive restart, and requires more backtracking operations. Hence, while it takes less iterations as compared to the TFOCS-GRA, it performs worse in terms of timing. For illustration purposes, we ran the algorithms to high accuracy. However, if a typical stopping criteria such as 10 6 is used, our algorithm Prox Grad3g obtains 3 to 8 speed-ups over the standard gradient algorithm with backtracking enhancements.

20 40 60 80

F ( xk) F ( x )

|F ( x ) | in log-scale

# of iterations

Prox Grad3 Prox Grad3g TFOCS N07 TFOCS GRA

20 40 60 80 100

F ( xk) F ( x )

|F ( x ) | in log-scale

# of iterations

Prox Grad3 Prox Grad3g TFOCS N07 TFOCS GRA

F ( xk) F ( x )

|F ( x ) | in log-scale

# of iterations

Prox Grad3 Prox Grad3g TFOCS N07 TFOCS GRA

Figure 8: Convergence plots of algorithms under comparison for n = 3000 and p = 10000. From left to right, ρ = 10 3, 2

3 10 4, 5 10 4.

20 40 60 80 100

F ( xk) F ( x )

|F ( x ) | in log-scale

# of iterations

Prox Grad3 Prox Grad3g TFOCS N07 TFOCS GRA

20 40 60 80 100 120

F ( xk) F ( x )

|F ( x ) | in log-scale

# of iterations

Prox Grad3 Prox Grad3g TFOCS N07 TFOCS GRA

F ( xk) F ( x )

|F ( x ) | in log-scale

# of iterations

Prox Grad3 Prox Grad3g TFOCS N07 TFOCS GRA

Figure 9: Convergence plots of algorithms under comparison for n = 15000 and p = 50000. From left to right, ρ = 2 10 4, 4

3 10 4, 10 4.

6. Conclusions

We propose a variable metric method for minimizing convex functions that are compositions of proximity functions with self-concordant smooth functions. Our framework does not rely on the usual Lipschitz gradient assumption on the smooth part for its convergence theory. A highlight of this work is the new set of analytic step-size selection and correction procedures, which are best matched to the underlying problem structures. Our empirical results illustrate that the new theory leads to signiﬁcant improvements in the practical performance of the algorithmic instances when tested on a variety of diﬀerent applications. In this work, we present a convergence proof for composite minimization problems under the assumption of exact algorithmic calculations at each step of the methods. As future research direction, an interesting problem to pursue is the extension of this analysis to

Composite Self-concordant Minimization

Problem Prox Grad3 / Prox Grad3g / TFOCS-N07 / TFOCS-GRA

(3000, 10000) #iteration CPU time [s] β 0 bβ 0 Overlap (%)

ρ = 10 3 36 24 79 88 1.0096 0.7862 3.2759 1.7648

3 10 4 54 54 94 119 1.2974 1.2918 3.6499 2.4002 378 92.22

ρ = 5 10 4 78 78 97 166 1.7420 1.7513 3.7794 3.3416 412 100

(15000, 50000) #iteration CPU time [s] β 0 bβ 0 Overlap (%)

ρ = 2 10 4 36 30 99 110 21.7937 19.3241 82.3298 46.0475

3 10 4 60 54 108 136 31.7884 29.1194 89.4279 57.9088 1886 87.91

ρ = 10 4 90 90 113 166 44.2692 44.0611 95.3060 70.0946 2201 100

Table 5: Metadata on the Lasso problem with unknown variance

include inexact calculations and study how these errors propagate into the convergence and convergence rate guarantees (Kyrillidis et al., 2014). We hope this paper triggers future eﬀorts along this direction.

Acknowledgments

This work is supported in part by the European Commission under Grant MIRG-268398, ERC Future Proof and SNF 200021-132548, SNF 200021-146750 and SNF CRSII2-147633. The authors are also grateful to three anonymous reviewers as well as to the action editor for their thorough reviews of this work, comments and suggestions on improving the content and the presentation of this paper.

Appendix A. Technical proofs

In this appendix, we provide the detailed proofs of the theoretical results in the main text. It consists of global convergence and local convergence rate of our algorithms and other technical proofs.

A.1 Proof of Lemma 4

Since g is convex, we have g(y) g(x) + v T (y x) for all v g(x). By adding this inequality to (9) and noting that F(x) := f(x) + g(x), we obtain

F(y) F(x) + ( f(x) + v)T (y x) + ω( y x x) (52) F(x) λ(x) y x x + ω( y x x).

Here, the last inequality is due to the generalized Cauchy-Schwartz inequality and λ(x) := f(x) + v x. Let LF (F(x)) := {y dom(F) | F(y) F(x)} be a sublevel set of F. Then, for any y LF (F(x)), we have F(y) F(x) which leads to

λ(x) y x x ω( y x x),

due to (52). Let s(t) := ω(t)

t = 1 ln(1+t)

t . The last inequality leads to s( y x x) λ(x). Since the equation ln(1 + t) = (1 λ(x)) has unique solution t > 0 if λ(x) < 1. Moreover,

Tran-Dinh, Kyrillidis, and Cevher

the function s is strictly increasing and s(t) < 1 for t 0, which leads to 0 t t . Since s( y x x) λ(x), we have y x x t . Thus, LF (F(x)) is bounded. Hence, x exists due to the well-known Weierstrass theorem. The uniqueness of x follows from the strict increase of ω( ). Indeed, for any x dom(F), by the convexity of g we have g(x) g(x ) v T (x x ), where v g(x ). By the selfconcordant property of f, we also have f(x) f(x ) f(x )T (x x ) + ω( x x x ). Adding these inequalities and using the optimality condition (11), i.e., 0 = v + f(x ), we have F(x) F(x ) ω( x x x ). Now, let ˆx = x is also an optimal solution of (1). We have 0 = F(ˆx ) F(x ) ω( x x x ) > 0, which leads to a contradiction. This implies that x ˆx .

A.2 Proofs of Global Convergence: Theorem 6, Lemma 12 and Theorem 13

In this subsection, we provide the proofs of Theorem 6, Lemma 12 and Theorem 13 in a uniﬁed fashion. We ﬁrst provide a key result quantifying the improvement of the objective as a function of the step-size αk. Maximum decrease of the objective function: Let βk := dk Hk, λk := dk xk and:

xk+1 := xk + αkdk = (1 αk)xk + αksk,

where αk := β2 k λk(λk+β2 k) (0, 1]. We will prove below that the following holds at each iteration of the algorithms:

F(xk+1) F(xk) ω β2 k λk

Moreover, the choice of αk is optimal (in the analytical worst-case sense). Proof Indeed, since g is convex and αk (0, 1], we have g(xk+1) = g (1 αk)xk + αksk (1 αk)g(xk) + αkg(sk), which leads to

g(xk+1) g(xk) αk(g(sk) g(xk)). (54)

For xk+1 dom(F) so that xk+1 xk xk < 1, the bound (10) holds. Combining (54) with the self-concordant property (10) of f, we obtain

F(xk+1) F(xk) + f(xk)T (xk+1 xk) + ω xk+1 xk xk + αk g(sk) g(xk)

(55) (16) F(xk) + αk f(xk)T dk + ω αk dk xk + αk g(sk) g(xk) .

Since sk is the unique solution of (15), by using the optimality condition (17), we get

f(xk) Hk(sk xk) g(sk)

(56) f(xk)T (sk xk) sk xk 2 Hk (sk xk)T g(sk).

Combining (56) with g(xk) g(sk) v T (xk sk), v (sk), due to the convexity of g( ), we have g(sk) g(xk) f(xk)T (sk xk) sk xk 2 Hk. (57)

Composite Self-concordant Minimization

Using (57) in (55) together with the deﬁnitions of βk and λk, we obtain

F(xk+1) (16) F(xk) αkβ2 k + ω (αkλk) . (58)

Let us consider the function ϕ(α) := αβ2 k ω (αλk). By the deﬁnition of ω ( ), we can easily show that ϕ(α) attains the maximum at:

αk := β2 k λk(λk + β2 k), (59)

provided that αk (0, 1]. We note that the choice of αk as (59) preserves the condition xk+1 xk xk < 1. Moreover, ϕ(αk) = ω(β2 k/λk), which proves (53).

Proof of Theorem 6: Since Hk := 2f(xk), we observe βk := dk Hk dk xk =: λk, where dk dk n. In this case, the step size αk in (59) becomes αk = 1 1+λk which is in (0, 1). Moreover, (53) reduces to:

F(xk+1) F(xk) ω(λk),

which is indeed (21). Finally, we assume that, for a given σ (0, 1), we have λk σ for 0 k kmax 1. Since ω strictly increases, it follows from (21) by induction that:

F(x ) F(xk) F(x0)

j=0 ω(λj) F(x0) kω(σ).

This estimate shows that the number of iterations to reach λk < σ is at most kmax = j F(x0) F(x )

ω(σ) k + 1.

Proof of Lemma 12: Proof of Lemma 12 immediately follows from (53) by taking Hk Dk and dk dk g. Proof of Theorem 13: We consider the sequence F(xk)

k 0. By Lemma 12, this sequence is nonincreasing. Moreover, F(x0) F(xk) F(x ) for all k 0. As a result, the sequence F(xk)

k 0 converges to a ﬁnite value F . By Lemma 12, we can derive

dj g 2 Dj dj g xj

F(x0) F < + .

Since the function ω(τ) = τ ln(1 + τ) τ 2

4 for τ (0, 1] is increasing, this implies that limj dj g 2 2/ dj g xj = 0 due to the fact that Dk LI 0. Since LF (F(x0)) is bounded, by applying Zangwill s convergence theorem in (Zangwill, 1969), we can show that every limit point x of the sequence xk

k 0 is the stationary point of (11). Since x is unique, the whole sequence xk

k 0 converges to x .

Tran-Dinh, Kyrillidis, and Cevher

A.3 Proofs of Local Convergence: Theorem 7, Theorem 11 and Theorem 14

We ﬁrst provide a ﬁxed-point representation of the optimality conditions and prove some key estimates used in the sequel. Optimality conditions as ﬁxed-point formulations: Let f be a given standard selfconcordant function, g be a given proper, lower semicontinuous and convex function, and Hk be a given symmetric positive deﬁnite matrix. Besides the two key inequalities (9) and (10), we also need the following inequality (Nesterov and Nemirovski, 1994; Nesterov, 2004, Theorem 4.1.6) in the proofs below:

(1 y x x)2 2f(x) 2f(y) (1 y x x) 2 2f(x), (60)

for any x, y dom(f) such that y x x < 1. Let x be the unique solution of (1) and x be strongly regular, i.e., 2f(x ) 0. Then the Dikin ellipsoid W(x , 1) := {x Rn | x x x < 1} also belongs to dom(f). Moreover, 2f(x) 0 for all x W(x , 1) due to (Nesterov, 2004, Theorem 4.1.5). Hence, the strong regularity assumption is suﬃcient to ensure that 2f is positive deﬁnite in the neighborhood W(x , 1). For a ﬁxed x dom(F), where F := f + g, we redeﬁned the following operators, based on the ﬁxed-point characterization and (15):

P g x(z) := P g 2f( x)(z), S x(z) := 2f( x)z f(z), (61)

and e x(Hk, z) := 2f( x) Hk (z xk). (62)

Here, P g x and S x can be considered as a generalized proximal operator of g and the gradient step of f, respectively. While e x(Hk, ) measures the error between 2f( x) and Hk along the direction z xk. Next, given sk is the unique solution of (15), we characterize the optimality condition of the original problem (1) and the subproblem (15) based on the P g x, S x and e x(Hk, ) operators. From (17), we have

S x(xk) + e x(Hk, sk) 2f( x)sk + g(sk).

By the deﬁnition of P g x in (61), the above expression leads to

sk = P g x S x(xk) + e x(Hk, sk) . (63)

By replacing x with x , i.e., the unique solution of (1), into (63) we obtain

sk = P g x Sx (xk) + ex (Hk, sk) . (64)

Moreover, if we replace Hk by 2f(x ) (which is assumed to be positive deﬁnite) in the ﬁxed-point expression (12), we ﬁnally have

x = P g x (Sx (x )) . (65)

Formulas (63) to (65) represent the ﬁxed-point formulation of the optimality conditions.

Composite Self-concordant Minimization

Key estimates: Let rk := xk x x and λk be deﬁned by (18). For any αk (0, 1], we prove the following estimates:

sk+1 n sk n xk α2 kλ2 k 1 αkλk + 2αkλk α2 kλ2 k (1 αkλk)2 dk+1 xk, (66)

sk x x r2 k 1 rk + (Hk 2f(x ))dk x , (67)

provided that αkλk < 1, rk < 1 and the ﬁrst estimate (66) requires Hk = 2f(xk). Proof First, by using the nonexpansiveness of P g xk in Lemma 2, it follows from (63) that:

sk+1 sk xk = P g xk(Sxk(xk+1) + exk(Hk+1, sk+1)) P g xk(Sxk(xk) + exk(Hk, sk)) xk

(8) Sxk(xk+1) + exk(Hk, sk) Sx (x )

f(xk+1) f(xk) 2f(xk)(xk+1 xk)

xk (68) + exk(Hk+1, sk+1) exk(Hk, sk)

2f(xk + τ(xk+1 xk)) 2f(xk) (xk+1 xk)dτ

+ exk(Hk+1, sk+1) exk(Hk, sk)

where (i) is due to the mean-value theorem, respectively. Second, we estimate the ﬁrst term in (68). For this purpose, we deﬁne

Σk := R 1 0 2f(xk + τ(xk+1 xk)) 2f(xk) dτ, Mk := 2f(xk) 1/2Σk 2f(xk) 1/2. (69)

Based on the proof of (Nesterov, 2004, Theorem 4.1.14), we can show that:

Mk 2 xk+1 xk xk 1 xk+1 xk xk .

Using this estimate, the deﬁnition (69) and noting that xk+1 = xk + αkdk, we obtain

Σk(xk+1 xk) xk = (xk+1 xk)T Σk 2f(xk) 1Σk(xk+1 xk) 1/2

= (xk+1 xk)T 2f(xk)1/2MT k Mk 2f(xk)1/2(xk+1 xk) 1/2

= Mk 2f(xk)1/2(xk+1 xk) 2

(i) Mk 2 (xk+1 xk)T 2f(xk)(xk+1 xk) 1/2 (70)

= Mk 2 xk+1 xk xk

xk+1 xk 2 xk 1 xk+1 xk xk

= α2 k dk 2 xk 1 αk dk xk ,

Tran-Dinh, Kyrillidis, and Cevher

where (i) is due to the Cauchy-Schwartz inequality. Third, we consider the second term in (68) for Hk 2f(xk). By the deﬁnition of e x, it is obvious that exk( 2f(xk), sk) = 0. Hence, we have

T2 := exk( 2f(xk+1), sk+1) exk( 2f(xk), sk) xk = exk( 2f(xk+1), sk+1) xk (71)

= 2f(xk+1) 2f(xk) dk+1

We now deﬁne the following quantity, whose spectral norm we bound below

Nk := 2f(xk) 1/2 2f(xk+1) 2f(xk) 2f(xk) 1/2. (72)

By applying (60) with x = xk and y = xk+1, we can bound the spectral norm of Nk as follows:

Nk 2 max n 1 1 xk+1 xk xk 2, 1 xk+1 xk xk 2 1 o

= 2 xk+1 xk xk xk+1 xk 2 xk (1 xk+1 xk xk)2 .

Therefore, from (71) we can obtain the following estimate:

(T2)2 = exk( 2f(xk+1), sk+1)T 2f(xk) 1exk( 2f(xk+1), sk+1)

= (dk+1)T 2f(xk)1/2 N2 k 2f(xk)1/2 dk+1 (74)

Nk 2 2 dk+1 2 xk.

By substituting (73) into (74) and noting that αkdk = xk+1 xk, we obtain

T2 2αk dk xk α2 k dk 2 xk (1 αk dk xk)2 dk+1 xk. (75)

Now, by substituting (70) and (75) into (68) and noting that Hk 2f(xk), sk sk n, dk dk n and λk dk n xk, we obtain

sk+1 n sk n xk α2 k dk n 2 xk 1 αk dkn xk + 2αk dk n xk α2 k dk n 2 xk (1 αk dkn xk)2 dk+1 n xk.

which is indeed (66). Similarly to the proof of (68) and (70), we have

sk x x (65) = P g x (Sx (xk) + ex (Hk, sk)) P g x (Sx (x )) x

(8) Sx (xk) + ex (Hk, sk) Sx (x ) x

f(xk) f(x ) 2f(x )(xk x )

x + ex (Hk, sk) x (76)

2f(x + τ(xk x )) 2f(x ) (xk x )dτ

x + ex (Hk, sk) x

(70) xk x 2 x 1 xk x x + Hk 2f(x ) dk x ,

Composite Self-concordant Minimization

which is indeed (67) since rk = xk x x .

Proof of Theorem 7: Since xk = sk n dk n due to (20), we have xk+1 = xk + αkdk n = sk n (1 αk)dk n, which leads to

dk+1 n = sk+1 n xk+1 = sk+1 n sk n + (1 αk)dk n.

By applying the triangle inequality to the above expression, we have

dk+1 n xk = sk+1 n sk n + (1 αk)dk n xk sk+1 n sk n xk + (1 αk) dk n xk. (77)

Substituting (66) into (77) we obtain

dk+1 n xk α2 kλ2 k 1 αkλk + 2αkλk α2 kλ2 k (1 αkλk)2 dk+1 xk + (1 αk)λk.

Rearranging this inequality we get

(1 αkλk) 1 αk + (2α2 k αk)λk

1 4αkλk + 2α2 kλ2 k

provided that 1 4αkλk + 2α2 kλ2 k > 0. Now, by applying (60) with x = xk and y = xk+1, one can show that

dk+1 n xk+1 dk+1 n xk 1 αk dkn xk . (79)

We note that 1 4αkλk + 2α2 kλ2 k > 0 if αkλk < 1 1/

2. By combining (78) and (79) we obtain

dk+1 n xk+1 1 αk + (2α2 k αk)λk 1 4αkλk + 2α2 kλ2 k

which is (22). Next, we consider the sequence xk

k 0 generated by damped step proximal Newton method (20) with the step size αk = (1 + λk) 1. It is clear that (22) is transformed into:

λk+1 2λk 1 2λk λ2 k λk. (80)

Assuming λk σ :=

5 2, we can easily deduce that 2λk 1 2λk λ2 k 1 and thus, λk+1 λk. By induction, if λ0 σ then, λk+1 λk for all k 0. Moreover, we have λk+1 2 1 2 σ σ2 λ2 k, which shows that the sequence {λk}k 0 converges to zero at a quadratic rate, which completes the proof of part b).

Now, since αk = 1, the estimate (22) reduces to λk+1 λ2 k 1 4λk+2λ2 k . By the same

argument as in the proof of part b), we can show that the sequence {λk}k 0 converges to zero at a quadratic rate.

Tran-Dinh, Kyrillidis, and Cevher

Finally, we prove the last statement in Theorem 7. By substituting Hk := 2f(xk) into (66), we obtain

sk x x r2 k 1 rk + ( 2f(xk) 2f(x ))dk x . (81)

Let T3 := ( 2f(xk) 2f(x ))dk x . Similarly to the proof of (75), we can show that:

T3 2 xk x x xk x 2 x (1 xk x x )2

dk x αk (2 rk)rk

(1 rk)2 (rk+1 + rk). (82)

Here the second inequality follows from the fact that dk x = αk xk+1 xk x αk[ xk+1 x x + xk x x ] = αk(rk+1 + rk). We also have rk+1 = xk+1 x x = (1 αk)xk + αksk x x (1 αk)rk + αk sk x x . Using these inequalities, (82) and (81) we get

rk+1 (1 αk)rk + αk r2 k 1 rk + α2 k (2 rk)rk

(1 rk)2 (rk+1 + rk). (83)

Rearranging this inequality to obtain

rk+1 1 αk + (2α2 k + 3αk 2)rk + (1 αk α2 k)r2 k 1 2(1 + α2 k)rk + (1 + α2 k)r2 k

We consider two cases: Case 1: αk = 1: We have rk+1 3 rk 1 4rk+2r2 k r2 k. Hence, if rk < 1 1/

2 then 1 4rk+2r2 k > 0.

Moreover, rk+1 rk if 3rk r2 k < 1 4rk + 2r2 k, which is satisﬁed if rk < (7

37)/6 0.152873. Now, if we assume that r0 σ (0, (7

37)/6), then, by induction, we have rk+1 3 σ 1 4σ+2σ2 r2 k. This shows that {rk}k 0 locally converges to 0+ at a quadratic rate. Since rk := xk x x , we can conclude that xk x at a quadratic rate as k .

Case 2: αk = (1 + λk) 1: Since λk = xk+1 xk xk xk+1 x x + xk x x

1 xk x x = rk+1+rk

We have 1 αk rk+1+rk (1+λk)(1 rk) rk+1+rk

1 rk . Substituting this into (83) and using the fact that αk 1, we have

rk+1 (rk+1 + rk)rk

1 rk + r2 k 1 rk + (2 rk)rk

(1 rk)2 (rk+1 + rk).

Rearranging this inequality, we ﬁnally get

rk+1 4 3rk 1 5rk + 3r2 k r2 k. (85)

Since 1 5rk + 3r2 k > 0 if rk < (5

13)/6, we can see from (85) that rk < (9

57)/12 0.120847 then rk+1 rk. By induction, if we choose r0 σ (0, (9

57)/12) then rk+1 4 3 σ 1 5 σ+3 σ2 r2 k, which shows that {rk}k 0 converges to 0+ at a quadratic rate. Consequently, the sequence xk

k 0 locally converges to x at a quadratic rate.

Composite Self-concordant Minimization

Proof of Theorem 11: We ﬁrst prove the statement (a). Since xk+1 sk q due to (25), from (67) we have

rk+1 r2 k 1 rk + Hk 2f(x ) (xk+1 xk) x . (86)

Now, by using the condition (26), we can easily show that the sequence xk

k 0 converges super-linearly to x provided that x0 x x ρ0 < 1. Next, we prove the statement (b). It is well-known (see, e.g., Nocedal and Wright (2006)) that if matrix Hk is positive deﬁnite and (yk)T (zk) > 0 then the matrix Hk+1 updated by (24) is also positive deﬁnite. Indeed, we have (yk)T (zk) = R 1 0 (zk)T 2f(xk + tzk)zkdt. Therefore, under the condition zk xk < 1, we can show that (yk)T (zk)

(zk)T 2f(xk)zk = zk 2 xk > 0. By multiplying (24) by zk we can easily see that Hk+1 satisﬁes the secant equation (23). Finally, we estimate yk 2f(x )zk x as follows:

yk 2f(x )zk x rk + rk+1 (1 rk)(1 rk+1) zk x . (87)

Now, by assumption that P k=0 rk < + , we obtain from (87) that P k=0 εk < + , where εk := rk+rk+1 (1 rk)(1 rk+1). By applying (Byrd and Nocedal, 1989, Theorem 3.2.), we can show

that the Dennis-Mor e condition (26) is satisﬁed. This implies that the sequence xk

k 0 generated by scheme (25) converges super-linearly to x . Proof of Theorem 14: For xk x x < 1, from (67), we have

sk g x x xk x 2 x 1 xk x x + Dk 2f(x ) dk x . (88)

Now, using the condition Dk 2f(x ) dk x (1/2) dk g x , (88) implies:

sk g x x xk x 2 x 1 xk x x + γ dk g x

xk x 2 x 1 xk x x + γ sk g x x + γ xk x x ,

where γ (0, 1/2). Rearranging this inequality, we obtain

sk g x x 1 1 γ

γ + xk x x 1 xk x x

xk x x . (89)

Now, since xk+1 = xk + αkdk g = (1 αk)xk + αksk g, we can further estimate from (89) as

xk+1 x x (1 αk) xk x x + αk sk g x x

1 αk + αk 1 γ

γ + xk x x 1 xk x x

Tran-Dinh, Kyrillidis, and Cevher

Let us deﬁne ψk := (1 αk) + αk 1 γ

γ + xk x x 1 xk x x

. Then, for γ < 1/2, ψk < 1 if xk x x < 1 2γ 2(1 γ). Therefore, by induction, if we choose x0 x x < 1 2γ 2(1 γ), then xk x x < 1 2γ 2(1 γ) for all k 0. Moreover, xk+1 x x ψk xk x x for k 0

and ψk [0, 1). This implies that xk x x

k 0 linearly converges to zero with the

factor ψk. Finally, we assume that Dk := Lk I, the quantity in (72) satisﬁes

N := 2f(x ) 1/2 2f(x ) Hk 2f(x ) 1/2 = I Lk 2f(x ) 1.

Then, we can easily observe that:

N 2 = I Lk 2f(x ) 1 2 max n 1 Lk

o := γ , (91)

where σ min (respectively, σ max) is the smallest (respectively, largest) eigenvalue of 2f(x ). Using the estimate (91), we can derive

Dk 2f(x ) dk g x

(91) N 2 sk xk x γ dk g x ,

which proves the last conclusion of Theorem 14.

K. Banaszek, G. M. D Ariano, M. G. A. Paris, and M. F. Sacchi. Maximum-likelihood estimation of the density matrix. Phys. Rev. A., 61(010304):1 4, 1999.

O. Banerjee, L. El Ghaoui, and A. d Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research, 9:485 516, 2008.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(1):183 202, 2009a.

A. Beck and M. Teboulle. Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Trans Image Process., 18(11):2419 2434, 2009b.

S. Becker and M.J. Fadili. A quasi-Newton proximal splitting method. In Proceedings of Neutral Information Processing Systems Foundation, 2012.

S. Becker, E. Cand es, and M. Grant. Templates for convex cone problems with applications to sparse signal recovery. Mathematical Programming Computation, 3:165 218, 2011. ISSN 1867-2949.

A. Ben-Tal and A.K. Nemirovski. Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications. SIAM, 2001.

Composite Self-concordant Minimization

D.P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice Hall, 1989.

J.F. Bonnans. Local analysis of Newton-type methods for variational inequalities and nonlinear programming. Appl. Math. Optim, 29:161 186, 1994.

S. Boyd and L. Vandenberghe. Convex Optimization. University Press, Cambridge, 2004.

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1 122, 2011.

L.M. Briceno-Arias and P.L. Combettes. A monotone + skew splitting model for composite monotone inclusions in duality. SIAM J. Optim., 21(4):1230 1250, 2011.

R. H. Byrd and J. Nocedal. A tool for the analysis of quasi-Newton methods with application to unconstrained minimization. SIAM J. Numer. Anal., 26(3):727 739, 1989.

E. Candes and T. Tao. The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35(6):2313 2351, 2007.

A. Chambolle and T. Pock. A ﬁrst-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120 145, 2011.

E. Chouzenoux, J.-C. Pesquet, and A. Repetti. Variable metric forward-backward algorithm for minimizing the sum of a diﬀerentiable function and a convex function. J. Optim. Theory Appl., DOI 10.1007/s10957-013-0465-7:1 22, 2013.

P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Model. Simul., 4:1168 1200, 2005.

A. S. Dalalyan, M. Hebiri, K. Meziani, and J. Salmon. Learning heteroscedastic models by convex programming under group sparsity. Proc. of the International conference on Machine Learning, pages 1 8, 2013.

A. P. Dempster. Covariance selection. Biometrics, 28:157 175, 1972.

J.E. Dennis and J. J. Mor e. A characterisation of superlinear convergence and its application to quasi Newton methods. Mathemathics of Computation, 28:549 560, 1974.

J. Eckstein and D. Bertsekas. On the Douglas - Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program., 55:293 318, 1992.

F. Facchinei and J.-S. Pang. Finite-Dimensional Variational Inequalities and Complementarity Problems, volume 1-2. Springer-Verlag, 2003.

D. Goldfarb and S. Ma. Fast alternating linearization methods of minimization of the sum of two convex functions. Math. Program., Ser. A, pages 1 34, 2012.

Tran-Dinh, Kyrillidis, and Cevher

T. Goldstein and S. Osher. The split Bregman method for ℓ1-pegularized problems. SIAM J. Imaging Sciences, 2(2):323 343, 2009.

T. Goldstein, B. ODonoghue, and S. Setzer. Fast alternating direction optimization methods. Tech. report., Department of Mathematics, University of California, Los Angeles, USA, May 2012.

M. Grant, S. Boyd, and Y. Ye. Disciplined convex programming. In L. Liberti and N. Maculan, editors, Global Optimization: From Theory to Implementation, Nonconvex Optimization and its Applications, pages 155 210. Springer, 2006.

Z.T. Harmany, R.F. Marcia, and R. M. Willett. This is SPIRAL-TAP: Sparse poisson intensity reconstruction algorithms - theory and practice,. IEEE Transactions on Image Processing, 21(3):1084 1096, 2012.

Jean-Baptiste Hiriart-Urruty and Claude Lemar echal. Fundamentals of Convex Analysis. Springer, 2001.

C. J. Hsieh, M.A. Sustik, I.S. Dhillon, and P. Ravikumar. Sparse inverse covariance matrix estimation using quadratic approximation. Advances in Neutral Information Processing Systems (NIPS), 24:1 18, 2011.

J. Kim and H. Park. Fast active-set-type algorithms for ℓ1-regularized linear regression. In Proceedings of the 13th International Conference on Artiﬁcial Intelligience and Statistics (AISTATS), volume 9, pages 397 404, Sardinia, Italy, 2010.

A. Kyrillidis and V. Cevher. Fast proximal algorithms for self-concordant function minimization with application to sparse graph selection. Proc. of the 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 1 5, 2013.

A. Kyrillidis, R. Karimi Mahabadi, Q. Tran-Dinh, and V. Cevher. Scalable sparse covariance estimation via self-concordance. In Proc. of the 28th International Conference on Artiﬁcial Intelligence (AAAI-14), pages 1 9. 2014.

J.D. Lee, Y. Sun, and M.A. Saunders. Proximal Newton-type methods for convex optimization. Advances in Neural Information Processing Systems (NIPS), 25:827 835, 2012.

J. L ofberg. YALMIP: A toolbox for modeling and optimization in MATLAB. In Proceedings of the CACSD Conference, Taipei, Taiwan, 2004.

Z. Lu. Adaptive ﬁrst-order methods for general sparse inverse covariance selection. SIAM Journal on Matrix Analysis and Applications, 31(4):2000 2016, 2010.

H. Mine and M. Fukushima. A minimization method for the sum of a convex function and a continuously diﬀerentiable function. J. Optim. Theory Appl., 33:9 23, 1981.

A.S. Nemirovskii and M.J. Todd. Interior-point methods for optimization. Acta Numerica, pages 191 234, 2008.

Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Kluwer Academic Publishers, 2004.

Composite Self-concordant Minimization

Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1):127 152, 2005a.

Y. Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM J. Optimization, 16(1):235 249, 2005b.

Y. Nesterov. Gradient methods for minimizing composite objective function. Math. Program., 140(1):125 161, 2007.

Y. Nesterov. Barrier subgradient method. Math. Program., Ser. B, 127:31 56, 2011.

Y. Nesterov and A. Nemirovski. Interior-point Polynomial Algorithms in Convex Programming. Society for Industrial Mathematics, 1994.

Y. Nesterov and M.J. Todd. Self-scaled barriers and interior-point methods for convex programming. Math. Oper. Research, 22(1):1 42, 1997.

J. Nocedal and S.J. Wright. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, 2 edition, 2006.

P.A. Olsen, F. Oztoprak, J. Nocedal, and S.J. Rennie. Newton-like methods for sparse inverse covariance estimation. Advances in Neural Information Processing Systems (NIPS), pages 1 9, 2012.

P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electron. J. Statist., 5:935 988, 2011.

S. M. Robinson. Strongly Regular Generalized Equations. Mathematics of Operations Research, Vol. 5, No. 1 (Feb., 1980), pp. 43-62, 5:43 62, 1980.

R. T. Rockafellar. Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, 1970.

R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM J. Control Opt., 14:877 898, 1976.

B. Rolfs, B. Rajaratnam, D. Guillot, I. Wong, and A. Maleki. Iterative thresholding algorithm for sparse inverse covariance estimation. In Advances in Neural Information Processing Systems 25, pages 1583 1591, 2012.

K. Scheinberg and I. Rish. SINCO-a greedy coordinate ascent method for sparse inverse covariance selection problem. Tech. Report, IBM RC24837:1 21, 2009.

K. Scheinberg, S. Ma, and D. Goldfarb. Sparse inverse covariance selection via alternating linearization methods. Neural Information Processing Systems (NIPS), pages 1 9, 2010.

M. Schmidt, N.L. Roux, and F. Bach. Convergence rates of inexact proximal-gradient methods for convex optimization. Neural Information Processing Systems (NIPS), 2011.

Tran-Dinh, Kyrillidis, and Cevher

N. St adler, P. B ulmann, and S. Van de Geer. ℓ1-Penalization for Mixture Regression Models. Tech. Report., pages 1 35, 2012.

Q. Tran-Dinh, A. Kyrillidis, and V. Cevher. A proximal newton framework for composite minimization: Graph learning without Cholesky decompositions and matrix inversions. International Conference on Machine Learning (ICML), 28(2):271 279, 2013a.

Q. Tran-Dinh, I. Necoara, C. Savorgnan, and M. Diehl. An inexact perturbed path-following method for Lagrangian decomposition in large-scale separable convex optimization. SIAM J. Optim., 23(1):95 125, 2013b.

Q. Tran-Dinh, C. Savorgnan, and M. Diehl. Combining Lagrangian decomposition and excessive gap smoothing technique for solving large-scale separable convex optimization problems. Compt. Optim. Appl., 55(1):75 111, 2013c.

Q. Tran-Dinh, A. Kyrillidis, and V. Cevher. An inexact proximal path-following algorithm for constrained convex minimization. SIAM J. Optimization (accepted), 2014a.

Q. Tran-Dinh, Y. H. Li, and V. Cevher. Barrier smoothing for nonsmooth convex minimization. In Proc. of the 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 1 4, 2014b.

E. van den Berg and M. P. Friedlander. Probing the pareto frontier for basis pursuit solutions. SIAM J. Sci. Comput., 31(2):890 912, 2008.

S. J. Wright, R. Nowak, and M. Figueiredo. Sparse reconstruction by separable approximation. IEEE Trans. Signal Processing, 57:2479 2493, 2009.

X. Yuan. Alternating direction method for covariance selection models. Journal of Scientiﬁc Computing, 51(2):261 273, 2012.

W.I. Zangwill. Nonlinear Programming. Prentice Hall, 1969.