# adaptive_three_operator_splitting__efd83cc5.pdf

Adaptive Three Operator Splitting

Fabian Pedregosa 1 2 Gauthier Gidel 3

We propose and analyze an adaptive step-size variant of the Davis-Yin three operator splitting. This method can solve optimization problems composed of a sum of a smooth term for which we have access to its gradient and an arbitrary number of potentially non-smooth terms for which we have access to their proximal operator. The proposed method sets the step-size based on local information of the objective hence allowing for larger step-sizes , only requires two extra function evaluations per iteration and does not depend on any step-size hyperparameter besides an initial estimate. We provide an iteration complexity analysis that matches the best known results for the non-adaptive variant: sublinear convergence for general convex functions and linear convergence under strong convexity of the smooth term and smoothness of one of the proximal terms. Finally, an empirical comparison with related methods on 6 different problems illustrates the computational advantage of the proposed method.

1 Introduction

Minimizing the sum of a smooth and a non-smooth term is at the core of many optimization problems that arise in machine learning and signal processing (Rudin et al., 1992; Cand es et al., 2006; Chambolle & Pock, 2016). In a few but important cases, such as 1 or group lasso regularization, the non-smooth term is simple enough so that its proximal operator is available in closed form or at least fast to compute. In this case, highly scalable methods such as proximal gradient descent (Beck & Teboulle, 2009; Nesterov et al., 2013) or proximal coordinate descent (Richt arik & Tak aˇc, 2014) have shown state of the art performance. However,

1University of California at Berkeley, USA 2Department of Computer Science, ETH Zurich, Switzerland 3Mila - Universit e de Montr eal, Canada. Correspondence to: Fabian Pedregosa <f@bianp.net>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

the desire to model increasingly complex phenomena has led to the development of a ﬂurry of penalties with costly to compute proximal operator. Examples are the overlapping group lasso (Jacob et al., 2009), multidimensional total variation (Barbero & Sra, 2014) or trend ﬁltering (Kim et al., 2009), to name a few.

A key observation is that, despite the difﬁculty in computing its proximal operator, many of these penalties can be decomposed as a sum of terms for which we have access to their proximal operator. Proximal splitting methods like the three operator splitting (Davis & Yin, 2017) offer a principled way to incorporate these penalties into the optimizer. In this work we will describe a method to solve optimization problems of the form

minimize x Rp f(x) + g(x) + h(x) , (OPT)

where f is convex and Lf-smooth (i.e., differentiable with Lf-Lipschitz gradient) and g, h are both convex but potentially non-smooth. We further assume g and h are proximal, i.e., we have access to the proximal operator.

This formulation allows to express a broad range of problems arising in machine learning and signal processing: the smooth term includes the least squares or logistic loss functions; the two proximal terms can be extended to an arbitrary number via a product space formulation and as we will see in 4.1 include many important penalties such as the group lasso with overlap, total variation, 1 trend ﬁltering, etc. Furthermore, the penalties can be extendedvalued, thus allowing an intersection for convex constraints through the use of the indicator function.

The three operator splitting (TOS) method (Davis & Yin, 2017) is a recently proposed method for problems of the form (OPT). At each iteration, it only requires to evaluate once the gradient of f and the proximal operator of g and h. It also relies on one step-size parameter, and while it can be set based on the Lipschitz constant of the gradient of f, this is not entirely satisfactory for two reasons. First, this constant is often costly to compute. Second, this constant is a global upper bound on the Lipschitz constant, while locally the Lipschitz constant might be smaller, allowing for larger step-sizes.

Adaptive Three Operator Splitting

Adaptive step-size methods, also known as inexact and backtracking line search, instead choose the step-size by verifying a sufﬁcient decrease condition at each iteration. This allows to take larger step-sizes and has proven to be an important ingredient in the practical implementation of ﬁrst and second-order methods (Nocedal & Wright, 2006).

Outline and main contributions. Our main contribution is the development and analysis of an adaptive variant of the TOS algorithm. The proposed algorithm does not depend on any step-size hyperparameter (besides an initial estimate) and enjoys similar convergence guarantees as the non adaptive variant. The paper is organized as follows:

Methods. 2 describes the proposed algorithm, extended in 2.1 to an arbitrary number of proximal terms.

Analysis. 3 provides a convergence analysis based on an interpretation of the algorithm as a saddle-point optimization method. This signiﬁcantly departs from the analysis of Davis & Yin (2017) for the non adaptive variant and results in improved and more general rates.

Applications. 4 discusses the application to different penalties and presents an empirical comparison on 6 different problems and 5 different penalties.

Notation. We denote vectors with boldface lower case letters (i.e., x), and matrices and vector-valued functions in boldface upper case (i.e., X, T ( )). denotes the euclidean vector norm. Given a matrix X Rn p, we denote by X the average along rows, that is, X = 1/n n i=1 Xi. We make extensive use of the proximal operator, deﬁned for a convex function ϕ and γ > 0 as

proxγϕ(x) def = arg min z Rp

2γ x z 2 . (1)

The domain of a function f : Rp ] , ] is dom f def = {x Rp | f(x) < }. The indicator function is denoted ı{condition}, which is 0 if condition is veriﬁed and + otherwise. Basic properties and deﬁnitions of convex functions are provided for convenience in Appendix A.

1.1 Related work

Proximal splitting methods that can solve problems involving a sum of terms by accessing the proximal operators of their constituents can be traced back to the 1970s in the works of Glowinski & Marroco (1975); Gabay & Mercier (1976); Lions & Mercier (1979). There has been a surge in interest in these methods in the last years due to their applicability in machine learning (Parikh & Boyd, 2013), signal processing (Combettes & Pesquet, 2011) and parallel optimization (Boyd et al., 2011).

Algorithms to solve problems of the form (OPT) with two or more proximal terms and a smooth term accessed via its gradient have recently been proposed. Examples are the generalized forward-backward splitting (Raguet et al., 2013), the three operator splitting (TOS) (Davis & Yin, 2017), the primal-dual hybrid gradient (PDHG) method, proposed in (Condat, 2013b; V u, 2013) and analyzed by Chambolle & Pock (2015) and the very recent primal-dual three operator splitting (Yan, 2018). We note that the last two methods can optimize a more general objective function in which h(x) is replaced with h(Kx) for an arbitrary matrix K. The original formulation of these methods requires to set the step-size based on criteria such as the Lipschitz constant of the gradient of the smooth term, but variants with adaptive step-size have recently emerged.

An adaptive step-size variant of the PDHG algorithm has recently been proposed by Malitsky & Pock (2018, 5). Compared to the proposed method, it requires one less function evaluation per iteration but since the original algorithm has two step-sizes, it still relies on one step-size hyperparameter. Convergence rates are not derived.

A different adaptive step-size strategy was proposed by Giselsson et al. (2016) as a general scheme for averaged operators. TOS is averaged for step-sizes < 2/Lf, and we denote the combination of both methods LSAO-TOS. An O(1/

t) convergence rate in terms of the operator residual norm is derived. Unfortunately, this quantity is difﬁcult to relate to the more common objective function suboptimality used in the other contributions.

Another adaptive step-size variant of TOS was proposed without proof in the technical report Davis & Yin (2015, Algorithm 3). It uses the same sufﬁcient decrease inequality as our method, although the iterates are deﬁned differently. We found the algorithm sometimes non-convergent and did not consider it further.

In contrast, we provide a convergence analysis for our method that achieves a O(1/t) convergence rate for the ergodic (i.e., averaged) iterate, and linear convergence under stronger assumptions, matching and in some cases even improving the best known rates of the non adaptive variant.

Method Adaptive Sublinear rate Linear rate Adaptive TOS

(this work)

TOS (Davis & Yin, 2017)

LSAO-TOS (Giselsson et al., 2016) 1

PDHG (Condat, 2013b; V u, 2013)

PDHG-LS (Malitsky & Pock, 2018)

1Convergence rate in terms of operator residuals.

Adaptive Three Operator Splitting

In this section we present our main contribution, a three operator splitting method with adaptive step-size. The method is detailed in Algorithm 1 and requires at each iteration to evaluate once the gradient of f and the proximal operators of g and h, and perform two function evaluations of f. At iteration t the candidate step-size γt is chosen as to verify the following sufﬁcient decrease condition between the iterates zt and xt+1 (Line 4):

f(xt+1) Qt(xt+1, γt) , with Qt deﬁned as

Qt(x, γ) def = f(zt) + f(zt), x zt + 1

This inequality can be interpreted as a quadratic upper bound condition on f at xt+1: the right-hand side is a quadratic Qt which is tangent to f at zt with amplitude (2γt) 1, and both sides are evaluated at xt+1, deﬁned in Line 3. The underx zt xt+1

lying principle of choosing the step-size based on the minimization of a quadratic upper bound has already been successful for the proximal-gradient method, where it is also referred to as backtracking (Beck & Teboulle, 2009) or full relaxation (Nesterov et al., 2013). In fact, the proposed method coincides with the aforementioned when one of the proximal terms is constant.

By the properties of Lf-smooth functions, the sufﬁcient decrease condition is veriﬁed for any γt 1/Lf. Hence the step-size search loop always has a ﬁnite terminationand the step-size is lower bounded by γt min{τ/Lf, γ0}. The practical advantage of this strategy is that it allows to consider a step-size potentially larger than 1/Lf and verify whether the above is veriﬁed at each iteration. If it is, then the algorithm uses the current step-size, and if not, it decreases the step-size by a factor which we denote τ.

Growing step-size strategies. We consider two different strategies to initialize next iterate step-size. The ﬁrst strategy (Variant 1) is the simplest and consists in initializing the next step-size with the current one (Line 12). In this variant, the step-size is only allowed to decrease.

The second strategy (Variant 2) allows the step-size to increase but in exchange requires the proximal term h to be Lipschitz continuous (note, not smooth as f but only Lipschitz). This is the case of most penalties (i.e., 1, group lasso, total variation, etc.) but not of indicator functions and so is less general than the ﬁrst variant. As we will see in the applications section, the ability to grow the step-size has an important effect on its empirical performance.

Algorithm 1: Adaptive Three Operator Splitting Input: z0 Rp, u0 Rp, γ0 > 0, τ (0, 1)

1 for t = 0, 1, 2, . . . do

2 repeat step-size search loop

3 xt+1 = proxγtg(zt γtut γt f(zt))

4 if f(xt+1) Qt(xt+1, γt) then

5 break sufﬁcient decrease veriﬁed

7 γt = τγt decrease step-size

8 zt+1 = proxγth(xt+1 + γtut)

9 ut+1 = ut + (xt+1 zt+1)/γt

10 choose step-size for next iteration, two variants

11 Variant 1

12 γt+1 = γt

13 Variant 2 only if h is βh-Lipschitz

14 δt = Qt(xt+1, γt) f(xt+1)

15 Choose any γt+1 [γt,

γ2 t + γtδt(2βh) 2]

16 return xt+1, ut+1

Initial and default values. The proposed method takes as input 4 parameters, which we brieﬂy discuss, together with a growing step-size heuristic for Variant 2:

Initial guess z0 and u0. z0 is an initial guess of the primal problem (OPT), while u0 is an initial guess for a minimizer of a (yet to be deﬁned) dual function (11). In practice, we initialize both variables to zero.

Initial step-size γ0. To estimate a starting value for the step-size, we start with ε = 10 3, z = z0 ε f(z0) and divide ε by 10 until f( z) f(z0). Then we solve f( z) = Q0( z) for γ0 and double that estimate, giving

γ0 = 4(f(z0) f( z)) f(z0) 2 . (2)

The line search decrease parameter τ regulates the factor by which the step-size is decreased each time the line search condition is unsuccessful. This is a parameter that is common to all line search methods and can be set to any value τ (0, 1). Following (Malitsky & Pock, 2018) we set it to τ = 0.7.

step-size growth. Variant 2 allows the step-size to grow by an amount that depends on β 2 h . This quantity can be arbitrarily large (e.g., vanishing regularization), and so choosing the largest admissible step-size might result in too many decrease corrections. This can be avoided e.g. by limiting its growth to double every 20 iterations. Line 15 then becomes:

γt+1 = min{γt20.05,

γ2 t + γtδt(2βh) 2} . (3)

Adaptive Three Operator Splitting

Upon termination, the algorithm returns two vectors. The ﬁrst vector is an approximate solution to (OPT), while the second vector is an approximate minimizer of a dual objective which we will detail in 3.

Special cases and related methods. We mention two notable special cases of this algorithm. First, for any step-size γt 1/Lf, the line search condition will always succeed by the properties of Lf-smooth functions and so the stepsize in Variant 1 is constant. Deﬁning yt = xt +γtut 1, it is easy to verify that Algorithm 1 (Variant 1) can be written with a constant step-size γ = γt as an iteration of the form

zt = proxγh(yt)

xt+1 =proxγg(2zt yt γ f(zt))

yt+1 = yt zt + xt+1 ,

which is the standard (non-overrelaxed) form of the three operator splitting (Davis & Yin, 2017, Algorithm 1). The adaptive variant requires two extra function evaluations f(zt) and f(xt+1) for the line search condition in Line 4, but as we will see in the experimental section, most often the ability to take larger step outweighs this extra cost.

Second, for h = 0, we have from lines 8 and 9 that ut = 0 and in this case (ignoring growing step-size strategies), this algorithm simpliﬁes to the proximal gradient descent with line search of (Beck & Teboulle, 2009).

Algorithm 1 can be written equivalently in a way that highlights similarities and differences with the PDHG method. Using Moreau s decomposition proxγh(x) = x γ proxγh (x/γ) yields the following recurrence

ut+1 = proxh /γ(ut + xt/γ) , (5)

xt+2 = proxγg(xt+1 γ( f(zt+1) + 2ut+1 ut)) .

This form is almost identical to Algorithm 3.2 in (Condat, 2013b), but with a different step-size and the gradient evaluated at the extrapolated zt+1 = xt+1 γ(ut+1 ut) instead of the previous iterate xt+1 in PDHG.

2.1 Extension to multiple proximal terms

We now consider the problem of minimizing an objective of the form:

minimize x Rp ϕ(x) + k j=1 hj(x) , (OPT-k)

where ϕ is Lϕ-smooth and each hj is proximal. The adaptive three operator splitting can be used to solve problems of this form by reducing them to a problem of the form (OPT) in an enlarged space. Consider consider the

following problem in Rk p,

minimize X Rk p ϕ(X) =f(X)

+ k j=1 hj(Xj) =h(X)

+ ı{X1 = =Xk} =g(X)

It is easy to see that this problem shares the same set of solutions as (OPT-k) with the correspondence x = X, as the last term forces all the Xi terms to be equal. In this formulation the ﬁrst term is smooth, the second term is proximal (variables in hi are separated) and the proximal operator of the last term is given by X 1T . Hence Algorithm 1 can be applied to this problem. Deriving the complete algorithm is now merely a matter of replacing f, g, h by its appropriate values in Algorithm 1 and is speciﬁed in Appendix B. The resulting adaptive algorithm seems to be new also in this extended formulation.

It is also possible to swap the deﬁnitions of g and h, which results in a different algorithm that can be seen as an adaptive variant of the Generalized Forward-Backward splitting of Raguet et al. (2013). However, this formulation is less convenient for our purpose, since in this case the h term is always an indicator function and so it would not be possible to apply variant 2 of our algorithm.

In this section we provide a convergence rate analysis of the proposed method. We start by a characterization the set of ﬁxed points of the algorithm, followed by a discussion on the gap function used to measure suboptimality. Finally, we present convergence rates for two different function classes. All proofs can be found in Appendix C.

Assumption 1: Regularity. We assume that f is convex and Lf-smooth in Rp and that g and h are proper (i.e., have nonempty domain), lower semicontinuous (i.e., its sublevel sets are closed) convex functions. We note that lower semicontinuity is a weak form of continuity that allows extended-valued functions (such as the indicator function) over a closed domain.

Assumption 2: Qualiﬁcation conditions. We assume the relative interior of dom g and dom h have a non-empty intersection. This is a weak and standard assumption that, together with the regularity assumption, guarantees strong (also known as total) duality (Bertsekas, 2015, Prop. 5.3.8).

Using the deﬁnition of Fenchel conjugate, we can can reformulate (OPT) as a saddle-point problem:

min x Rd f(x) + g(x) + h(x) (6)

= min x Rd f(x) + g(x) + max u Rd x, u h (u) (7)

= min x Rd max u Rd f(x) + g(x) + x, u h (u)

def =L(x,u)

Adaptive Three Operator Splitting

We recall that a saddle point of L is a pair (x , u ) such that the following is veriﬁed for any (x, u) in the domain (Hiriart-Urruty & Lemar echal, 1993, 4.1):

L(x , u) L(x, u ) . (9)

A consequence of strong duality is the equivalence between the saddle points of L and the minimizers of the primal and dual objectives. Let P and D denote these primal and dual objectives:

P(x) def = f(x) + g(x) + h(x) (10)

D(u) def = (f + g) ( u) + h (u). (11)

Then if (x , u ) is a saddle point of L, x is a minimizer of P and u is a minimizer of D. Likewise, a pair of minimizers of P and D form a saddle point of L.

3.1 Fixed point characterization

A common ﬁrst step in the analysis of optimization methods is the study of its set of ﬁxed or stationary points. While this does not necessarily imply convergence, knowing which elements will be left invariant by the method improves our understanding and is a stepping stone for further analysis. We will show that the set of ﬁxed points of the algorithm has a particularly simple and elegant structure: the Cartesian product of primal and dual solutions.

For the purpose of analysis it will be useful to express Algorithm 1 as an iteration of the form, (zt+1, ut+1) = T γt(zt, ut), where the operator T γ is deﬁned as

T γ(z, u) def = (z+, u+), with (12)

z+ = proxγh(x(z, u) + γu) u+ = u + (x(z, u) z+)/γ x(z, u) = proxγg(z γ(u + f(z))) .

The following theorem characterizes the set of ﬁxed points of this operator, denoted Fix(T γ).

Theorem 1. Let P denote the set of minimizers of the primal objective P and D the set of minimizers of the dual objective D . Then the ﬁxed points of T γ are given by

Fix(T γ) = P D . (13)

3.2 Gap function

The progress of optimization methods is commonly measured in terms of a gap or merit function that is zero at optimum and nonzero otherwise. An appropriate gap function for many ﬁrst-order methods is the suboptimality of the objective function. However, this is not an appropriate suboptimality measure for this algorithm, as the objective

function might be + at an iterate, for example when the two proximal terms are an indicator function.

Davis & Yin (2015) avoid the issue by either evaluating h and g at different iterates (Davis & Yin, 2015, Corollary D.5.1) or assuming Lipschitz continuity of one of the proximal terms (Davis & Yin, 2015, Corollary D.5.2).

In this work we take an alternative approach, and instead use the following saddle point suboptimality criterion to measure the progress of our algorithm:

L(xt+1, u) L(x, ut+1) . (14)

From the deﬁnition of saddle point in Eq. (9), this criterion is non-positive for all (x, u) if and only if (xt+1, ut+1) is a saddle point, and is so an appropriate suboptimality criterion. Furthermore, contrary to the primal objective function, this is deﬁned for all iterates without further assumptions. Finally, we mention that this criteria has been previously used in the analysis of primal-dual methods, see e.g., Chambolle & Pock (2016; 2015) and Gidel et al. (2017) for a discussion of saddle point gap functions.

This suboptimality criteria can also be related to the primal and dual gap, as minimizing (14) over x and maximizing over u one recovers the primal-dual gap P(xt) D(ut) by deﬁnition of Fenchel conjugate.

3.3 Sublinear convergence

The following theorem gives a sublinear convergence rate for Algorithm 1. This convergence will be given in terms of the weighted ergodic (i.e., averaged) sequence. Denoting by st the sum of all step-sizes up to iteration t, i.e., st def = t 1 i=0 γt, the ergodic iterates xt and ut are deﬁned as

xt def = t 1

i=0 γixi+1 /st , ut def = t 1

i=0 γiui+1 /st . (15)

While results in this subsection will be stated in terms of this ergodic sequence, in practice the last iterate gives most often a better empirical convergence, see e.g., (Chambolle & Pock, 2015, 7.2.1) for a discussion of this phenomenon. For a more theoretically-sound algorithm, one can compare the objective at the ergodic and last iterate, and return the one with smallest objective.

Theorem 2 (sublinear convergence rate). For every t 0 and any (x, u) in the domain of L we have the following convergence rate for Algorithm 1 (both variants):

L(xt+1, u) L(x, ut+1) z0 x 2 + γ2 0 u0 u 2

Convergence in terms of function value suboptimality. The previous result gives an O(1/t) convergence rate for

Adaptive Three Operator Splitting

arbitrary convex functions in terms of the saddle point suboptimality. As we have discussed previously, it is not possible to obtain similar rates in terms of the function suboptimality without further assumptions. We will now show that it is sufﬁcient to assume Lipschitz continuity on h to derive from the previous theorem a convergence rate in terms of the primal function suboptimality.

The following Corollary can be obtained by optimizing with respect to u the bound in the previous theorem and using the Lipschitz continuity to bound u0 u 2. This gives an O(1/t) convergence rate for the primal function suboptimality, roughly matching that of Davis & Yin (2015, Corollary D.5.2) for the non adaptive variant:

Corollary 1. Let h be βh-Lipschitz. Then, we have the following rate for the weighted ergodic iterate on the objective

P(x) def = f(x) + g(x) + h(x):

P(xt+1) P(x ) z0 x 2+ 2γ2 0( u0 2+ β2 h) 2st .

3.4 Linear convergence

In this subsection we assume that f is µf-strongly convex and h is Lh-smooth (with 0 < µf, 0 < Lh < + ). We denote by x the minimizer of the primal loss (unique by strong convexity of P) and by u the minimizer of the dual loss (also unique by strong convexity of D, consequence of the Lh-smoothness of h).

The convergence rates will be given in terms of the following quantities

ρ def = µf min{γ0, τ/Lf} , σ def = 1/(1 + γ0Lh)

ξ def = µf/(µf + Lh) . (16)

All these belong to the interval (0, 1). Assuming γ0 τ/Lf, then ρ is the inverse of f s condition number, while σ depends the smoothness of h. ξ is only used by variant 2 and is a less tight bound that σ that depends on both the strong convexity of f and smoothness of h. Note that by strong convexity, γ0 < 1/µf as otherwise the sufﬁcient decrease condition would not succeed and so σ ξ.

Theorem 3 (linear convergence rate). Let xt+1, ut+1 be the iterates produced by Algorithm 1 after t iterations. Then we have the following linear convergence for Variant 1 (V1) and Variant 2 (V2):

V1 : xt+1 x 2 1 min ρ, σ t+1 D0 (17)

V2 : xt+1 x 2 1 min ρ, ξ, 1

2 t+1 E0 , (18)

with D0 def = 6 z0 x 2 + 6 1 σ γ0(u0 u ) 2 and

E0 def = 6 z0 x 2 + 6 1 ξ γ0(u0 u ) 2.

Discussion. For γt = 1/Lf, the sufﬁcient decrease condition is always veriﬁed and the algorithm can be run with τ = 1. In this case, Variant 1 of Algorithm 1 defaults to TOS, and we can compare the obtained rates with those in (Davis & Yin, 2015).

While the sublinear convergence rate obtained in Corollary 1 roughly matches the rate obtained in (Davis & Yin (2015, Corollary D.5.2, see our Appendix C.5)), the linear convergence rates are instead signiﬁcantly different. The linear convergence rate obtained in (Davis & Yin, 2015, Theorem D.6.6), after optimizing for all parameters, yields a rate of ρσ2, which is strictly worse than the min{ρ, σ} rate that we obtained. This difference can be quite large, e.g., for ρ = σ this becomes ρ versus ρ3.

Finally, we note that the number of evaluations of the sufﬁcient decrease condition can be bounded as in (Nesterov et al., 2013, Lemma 4).

4 Applications

4.1 Learning with Multiple Penalties

In this subsection we discuss how some penalties with costly to compute proximal operator can be decomposed as a sum of proximal terms and so fall within the current framework. The exact expression of the proximal operators is given in Appendix D.

Group lasso with overlap. Jacob et al. (2009) generalized group 1 norm by allowing each variable to belong to more than one group, thereby introducing overlaps among groups and allowing for more complex prior knowledge on the structure. For a set of subindices G which we will call groups, this penalty is deﬁned as x G =

G G [x]G 2. If each coefﬁcient is at most in s groups, then G can be decomposed as G = G1 . . . Gs, where the Gi are disjoint. This allows to express the group lasso with overlap as a sum of s non-overlapping group lasso penalties, for which the proximal operator has a closed form expression.

Multidimensional total variation. For the task of image restoration and denoising it is common to consider a regularization term in the form of a total variation regularizer. For an image x of size p q, the 2-dimensional total variation norm X TV is deﬁned as

j=1 |Xi,j+1 Xi,j|

ji=1 |Xi+1,j Xi,j|

From here we recognize that g and h are fused lasso (also known as 1D-total variation) penalties acting on the columns and rows of X respectively. Efﬁcient methods to

Adaptive Three Operator Splitting

evaluate the proximal operator of the fused lasso penalty have been developed by Condat (2013a); Johnson (2013).

Isotonic and nearly isotonic penalties. In some applications there exists a natural ordering between variables: x1 x2 xp. This can be enforced through constraints, and the projection onto these is known as isotonic regression (Best & Chakravarti, 1990). The indicator function over the set of constraints can also be split into a sum of two proximal terms (see Appendix D.2) as

ı{x1 x2 x3 x4 } (19)

= ı{x1 x2; x3 x4; } =g(x)

+ ı{x2 x3; x4 x5; } =h(x)

In cases in which the variables are only mostly nondecreasing, the constraint can be relaxed via a nearlyisotonic penalty (Tibshirani et al., 2011) of the form p 1 i=1 max{xi xi+1, 0}, in which only the nonincreasing coefﬁcients are penalized. This penalty can be split the same way as the isotonic constraints above.

1 trend ﬁltering. This penalty is deﬁned by the absolute value of the second order differences and promotes piecewise-linear coefﬁcients (Kim et al., 2009). It is deﬁned as x TF def = p 2 i=1 |xi 2xi+1 + xi+2|. We can split this sum into 3 proximal terms such that the resulting terms: the j-th term contains the factors for which i is congruent to 3 modulo j.

Constraints over doubly stochastic matrices. Optimization problems with constraints on the set of doubly stochastic matrices appear in many convex relaxations of combinatorial problems such as seriation (Fogel et al., 2013), quadratic assignment (Lawler, 1963) and graph matching (Conte et al., 2004; Aﬂalo et al., 2015). The set of double stochastic matrices is composed of square matrices with nonnegative entries, each of whose rows and columns sum to 1, i.e., {XT 1 = 1, X1 = 1, X 0}. The indicator function over this set can be split as

ı{XT 1 = 1, X1 = 1} =g(X)

+ ı{X 0} =h(X)

and the projection onto both sets is available in closed form (Lu et al., 2016, 4.3).

Dispersive sparsity. In some applications it is desirable to encourage dispersion of the sparse coefﬁcients. This happens for example in the modeling of neural spiking, as the spikes are assumed to be spaced across time (Hegde et al., 2009). El Halabi & Cevher (2015) showed that this behavior can be promoted by considering a penalty of the form x 1 + ı{B|x| c} for a matrix B and some predeﬁned constant c, where |x| denotes the component-wise

absolute value. This penalty can be split into three proximal terms by the introduction of a dummy variable z, resulting in x 1 + ı{Bz c} + ı{z = |x|}.

Combination by addition. A popular method to promote the joint behavior of different penalties is by adding them. This has been used to successfully learn models with sparse and nonnegative coefﬁcients (Yuan & Lin, 2007), sparse and low rank matrices (Richard et al., 2012), sparse and piecewise constant (Gramfort et al., 2013), to name a few.

4.2 Benchmarks

In this subsection we provide an empirical evaluation of the proposed method. Due to space constraints we only give here a high level overview, deferring details as well as an extended set of experiments to Appendix E. We consider the following methods:

The proposed Adaptive TOS method (Algorithm 1), in its both variants.

The TOS method of Davis & Yin (2015), with step-sizes 1/Lf and 1.99/Lf (the method is convergent for stepsizes < 2/Lf).

The PDHG or Condat-V u algorithm (Condat, 2013b;

V u, 2013), with step-sizes τ and β/τ, where β was chosen as the one giving the best overall performance over the grid β = 0.9, 0.5, 0.1 (giving it a slight advantage).

The adaptive PDHG of Malitsky & Pock (2018), with step-size hyperparameter β chosen by the same technique as for PDHG.

The averaged operator line search method of Giselsson et al. (2016) combined with TOS, named TOS-AOLS.

We compared these methods on 4 different problems and show the results in Figure 1. In the ﬁrst row we show the benchmarks on a logistic regression problem with overlapping group lasso penalty that we apply to two text datasets (RCV1 and real-sim). Subﬁgures A and C were run with the regularization parameter chosen to give 50% of sparsity, while B, E are run with higher levels of sparsity, chosen to give 5% of sparsity.

In the second and third row we considered a battery of inverse problems with different penalties on synthetic datasets. These consists of a least squares (G, H, I, J) or logistic regression (rest) smooth term and 4 different penalties speciﬁed in the title of each plot (overlapping group lasso, total variation, trace norm 1 and nearly isotonic, see Appendix E for a precise formulation). For each problem, we show 2 different benchmarks, corresponding to the low and high regularization regimes (denoted low reg and high reg). We comment on a few trends from Fig. 1:

Adaptive Three Operator Splitting

Figure 1: Comparison of different proximal splitting methods. The top row gives result for two real datasets with an overlapping group lasso penalty. The second and third row show results on synthetic datasets for 4 different penalties: overlapping group lasso (E, F), 2-dimensional total variation (G, H), trace norm + 1 (I, J) and nearly isotonic (K, L). The Adaptive TOS (Variant 2, i.e., with growing step-sizes) is the best performing method on 10 out of 12 experiments, and roughly equivalent to the best performing method in the other 2 cases (G, J).

Best performing method. On 10 out of 12 experiments, the adaptive TOS algorithm (Variant 2) is the best performing method, and in the other cases (E, H) its performance is roughly the same as that of the best performing method. In contrast, on 3 instances (A, I, K) it is an order of magnitude faster than the next method.

Low vs high regularization regime. The advantage of the adaptive method is highly correlated with the amount of regularization: in the low regularization regime, on 3 out of 6 the adaptive TOS is an order of magnitude faster then the ﬁxed step-size method, while in the high regularization regime the difference shrinks and in the same problems is never more than a factor 2.6.

Uniform curvature. The problems (G, H, I, J) in Fig. 1

use as smooth term a quadratic loss (i.e., constant Hessian), while the other methods use a logistic loss (nonconstant Hessian). This suggests that the use of the adaptive step-size strategy (and in particular Variant 2 with its growing step-size) is more beneﬁcial for smooth terms with non-uniform curvature.

5 Conclusion and Future Work

We have presented and analyzed an adaptive step-size method to solve optimization problems consisting in a sum of a smooth term accessed through its gradient and two or more potentially non-smooth terms accessed through their proximal operator. The method does not rely on any stepsize hyperparameter (except for an initial estimate) and extensive empirical evaluation has showed computational gains on a variety of problems. We mention two possible extensions of this work.

First, existing convergence results fail to fully explain their surprisingly good empirical convergence. To the best of our knowledge, no work so far has derived linear convergence rates in absence of strong convexity and smoothness of one of the proximal terms for these methods (as is however empirically observed, see e.g. Figure 1).

Second, it is an open question whether this or other adaptive step-size methods can be accelerated, as is the case of proximal gradient descent, which admits the adaptive FISTA variant (Beck & Teboulle, 2009).

Adaptive Three Operator Splitting

Acknowledgements

FP would like to thank Wotao Yin, Walaa Moursi and Elvis Dohmatob for fruitful discussions, Michael Eickenberg for feedback on the manuscript and Patrick R. Johnstone for reporting a typo in an earlier version.

FP is funded through the European Union s Horizon 2020 research and innovation programme under the Marie Sklodorowska-Curie grant agreement 748900. Computing time on was donated by Amazon through the program AWS Cloud Credits for Research .

GG is funded by the Canada Excellence Research Chair in Data Science for Realtime Decision-making and by the NSERC Discovery Grant RGPIN-2017-06936.

Aﬂalo, Y., Bronstein, A., and Kimmel, R. On convex relaxation of graph isomorphism. Proceedings of the National Academy of Sciences, 2015.

Agarwal, A., Negahban, S., and Wainwright, M. J. Fast global convergence rates of gradient methods for highdimensional statistical recovery. In Advances in Neural Information Processing Systems, 2010.

Barbero, A. and Sra, S. Modular proximal optimization for multidimensional total-variation regularization. preprint ar Xiv:1411.0589, 2014.

Bauschke, H. H. and Combettes, P. L. Convex analysis and monotone operator theory in Hilbert spaces. Springer Science & Business Media, 2017.

Bauschke, H. H., Bot , R. I., Hare, W. L., and Moursi, W. M.

Attouch Th era duality revisited: paramonotonicity and operator splitting. Journal of Approximation Theory, 2012.

Beck, A. and Teboulle, M. Gradient-based algorithms with applications to signal recovery. Convex Optimization in Signal Processing and Communications, 2009.

Bertsekas, D. P. Convex optimization algorithms. Athena Scientiﬁc Belmont, 2015.

Best, M. J. and Chakravarti, N. Active set algorithms for isotonic regression; A unifying framework. Mathematical Programming, 1990.

Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J.

Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 2011.

Cai, J.-F., Cand es, E. J., and Shen, Z. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 2010.

Cand es, E. J., Romberg, J., and Tao, T. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on information theory, 2006.

Chambolle, A. and Pock, T. On the ergodic convergence rates of a ﬁrst-order primal dual algorithm. Mathematical Programming, 2015.

Chambolle, A. and Pock, T. An introduction to continuous optimization for imaging. Acta Numerica, 2016.

Combettes, P. L. and Pesquet, J.-C. Proximal splitting methods in signal processing. In Fixed-point algorithms for inverse problems in science and engineering. Springer, 2011.

Condat, L. A direct algorithm for 1D total variation denoising. IEEE Signal Processing Letters, 2013a.

Condat, L. A primal dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms. Journal of Optimization Theory and Applications, 2013b.

Conte, D., Foggia, P., Sansone, C., and Vento, M. Thirty years of graph matching in pattern recognition. International journal of pattern recognition and artiﬁcial intelligence, 2004.

Davis, D. and Yin, W. A three-operator splitting scheme and its optimization applications. preprint ar Xiv:1504.01032v1, 2015.

Davis, D. and Yin, W. A three-operator splitting scheme and its optimization applications. Set-Valued and Variational Analysis, 2017.

El Halabi, M. and Cevher, V. A totally unimodular view of structured sparsity. In Proceedings of the 18th International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 2015.

Fogel, F., Jenatton, R., Bach, F., and d Aspremont, A. Convex relaxations for permutation problems. In Advances in Neural Information Processing Systems, 2013.

Gabay, D. and Mercier, B. A dual algorithm for the solution of nonlinear variational problems via ﬁnite element approximation. Computers & Mathematics with Applications, 1976.

Gidel, G., Jebara, T., and Lacoste-Julien, S. Frank-Wolfe Algorithms for Saddle Point Problems. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, 2017.

Adaptive Three Operator Splitting

Giselsson, P., F alt, M., and Boyd, S. Line search for averaged operator iteration. In IEEE 55th Conference on Decision and Control (CDC). IEEE, 2016.

Glowinski, R. and Marroco, A. Sur l approximation, par el ements ﬁnis d ordre un, et la r esolution, par p enalisation-dualit e d une classe de probl emes de Dirichlet non lin eaires. Revue franc aise d automatique, informatique, recherche op erationnelle, 1975.

Gramfort, A., Thirion, B., and Varoquaux, G. Identifying predictive regions from f MRI with TV-L1 prior. In International Workshop on Pattern Recognition in Neuroimaging. IEEE, 2013.

Hegde, C., Duarte, M. F., and Cevher, V. Compressive sensing recovery of spike trains using a structured sparsity model. In SPARS 09-Signal Processing with Adaptive Sparse Structured Representations, 2009.

Hiriart-Urruty, J.-B. and Lemar echal, C. Convex analysis and minimization algorithms I: Fundamentals. Springer science & business media, 1993.

Iusem, A. N. On Some Properties of Generalized Proximal Point Methods for Variational Inequalities. Journal of Optimization Theory and Applications, 1998.

Jacob, L., Obozinski, G., and Vert, J.-P. Group lasso with overlap and graph lasso. In Proceedings of the 26th annual international conference on machine learning. ACM, 2009.

Johnson, N. A dynamic programming algorithm for the fused lasso and L0-segmentation. Journal of Computational and Graphical Statistics, 2013.

Kim, S.-J., Koh, K., Boyd, S., and Gorinevsky, D. 1 trend ﬁltering. SIAM review, 2009.

Lawler, E. L. The quadratic assignment problem. Management science, 1963.

Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. RCV1: A new benchmark collection for text categorization research. Journal of machine learning research, 2004.

Lions, P.-L. and Mercier, B. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 1979.

Lu, Y., Huang, K., and Liu, C.-L. A fast projected ﬁxedpoint algorithm for large graph matching. Pattern Recognition, 2016.

Malitsky, Y. and Pock, T. A ﬁrst-order primal-dual algorithm with linesearch. SIAM Journal on Optimization, 2018.

Nesterov, Y. et al. Gradient methods for minimizing composite objective function. Mathematical Programming, 2013.

Nocedal, J. and Wright, S. Numerical optimization. Springer Science & Business Media, 2006.

Parikh, N. and Boyd, S. Proximal algorithms. Foundations and Trends in optimization, 2013.

Pedregosa, F. C-OPT: composite optimization in Python. 2018. doi: 10.5281/zenodo.1283339. URL http:// openopt.github.io/copt/.

Raguet, H., Fadili, J., and Peyr e, G. A generalized forwardbackward splitting. SIAM Journal on Imaging Sciences, 2013.

Richard, E., Savalle, P.-a., and Vayatis, N. Estimation of Simultaneously Sparse and Low Rank Matrices. In Proceedings of the 29th International Conference on Machine Learning, 2012.

Richt arik, P. and Tak aˇc, M. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 2014.

Rockafellar, R. T. Convex analysis, 1997.

Rockafellar, R. T. and Wets, R. J.-B. Variational analysis. Springer Science & Business Media, 1998.

Rudin, L. I., Osher, S., and Fatemi, E. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 1992.

Tibshirani, R. J., Hoeﬂing, H., and Tibshirani, R. Nearlyisotonic regression. Technometrics, 2011.

V u, B. C. A splitting algorithm for dual monotone inclusions involving cocoercive operators. Advances in Computational Mathematics, 2013.

Yan, M. A New Primal Dual Algorithm for Minimizing the Sum of Three Functions with a Linear Operator. Journal of Scientiﬁc Computing, 2018.

Yuan, M. and Lin, Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2006.

Yuan, M. and Lin, Y. On the non-negative garrotte estimator. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2007.