# learning_with_subquadratic_regularization__a_primaldual_approach__8ad140ef.pdf

Learning With Subquadratic Regularization : A Primal-Dual Approach

Raman Sankaran1,3 , Francis Bach2 and Chiranjib Bhattacharyya3

1Linked In, Bengaluru 2INRIA - Ecole Normale Sup erieure - PSL Research University, Paris 3Indian Institute of Science, Bengaluru rsankaran@linkedin.com, francis.bach@inria.fr, chiru@iisc.ac.in

Subquadratic norms have been studied recently in the context of structured sparsity, which has been shown to be more beneﬁcial than conventional regularizers in applications such as image denoising, compressed sensing, banded covariance estimation, etc. While existing works have been successful in learning structured sparse models such as trees, graphs, their associated optimization procedures have been inefﬁcient because of hard-to-evaluate proximal operators of the norms. In this paper, we study the computational aspects of learning with subquadratic norms in a general setup. Our main contributions are two proximal-operator based algorithms ADMM-η and CP-η, which generically apply to these learning problems with convex loss functions, and achieve a proven rate of convergence of O(1/T) after T iterations. These algorithms are derived in a primal-dual framework, which have not been examined for subquadratic norms. We illustrate the efﬁciency of the algorithms developed in the context of tree-structured sparsity, where they comprehensively outperform relevant baselines.

1 Introduction

Structured sparse regularizers [Kyrillidis, 2016; Bach et al., 2011; Jenatton et al., 2011b; Micchelli et al., 2013] have emerged as efﬁcient and versatile tools to add prior knowledge to estimation problems arising in domains such as computer vision [Mairal et al., 2014], bioinformatics [Obozinski et al., 2011], neural imaging [Jenatton et al., 2011a] among many others. They typically lead to convex optimization problems of the form

min w Rd F(Xw) + λΩ(w), (1)

where X Rn d is the data matrix, w 7 F(Xw) is the convex data-ﬁtting term, the norm Ωis the regularizer, and λ > 0 is a balancing hyper-parameter. We refer to F : Rn R as the loss function. Popular loss functions include the square

The author is currently afﬁliated with Linked In, while the work was carried out when the afﬁliation was Indian Institute of Science.

loss used in least-squares regression [Tibshirani, 1994], the hinge loss used in support vector machines (SVM) [Vapnik, 2000], or the logistic loss used in logistic regression [Lee et al., 2006]. Many algorithms exist if the proximal operator for Ω(denoted as proxΩ) is known and efﬁcient to compute [Bach et al., 2011; Parikh and Boyd, 2014]. Depending on the loss function F, we may use forward-backward splitting methods such as FISTA [Beck and Teboulle, 2009; Nesterov, 2007] when F is smooth; ADMM [Parikh and Boyd, 2014], when one knows the proximal operator of F X : w 7 F(Xw); or Chambolle-Pock algorithm [Chambolle and Pock, 2011], when only prox F is known. In this paper, we focus on norms Ωwhich are subquadratic [Bach et al., 2011], expressed as follows:

w2 j ηj + Γ(η) , (2)

where Γ : Rd + 7 R is convex and positively homogeneous. Note that the norm (2) includes as special cases the popular examples such as the ℓ1-norm [Tibshirani, 1994] and grouped ℓ1,p-norms for p [1, 2] [Jenatton et al., 2011b; Kloft et al., 2011]. For many such norms of the form (2), neither Ωnor proxΩis easily evaluated, where proxΓ may be computable with an existing efﬁcient algorithm. Examples include various norms derived from tree and graph structured constraints [Micchelli et al., 2013; Baldassarre et al., 2012], box-structured constraints [Mc Donald et al., 2016] and norms obtained as convex relaxations of combinatorial penalties [Obozinski and Bach, 2016; Sankaran et al., 2017], among many others. Thus, to solve (1) with Ωof the form (2), we may not be able to easily extend the aforementioned techniques which rely on efﬁcient computation of proxΩ. Hence, we use the relation (2) and reformulate (1) into the following optimization problem in variables w and η, which is the main focus of this paper:

min w Rd,η Rd + F(Xw) + λ

2 Pd j=1 w2 j ηj + λ

2 Γ(η) | {z } Ψ(w,η)

To solve (3), when prox F X and proxΓ are easy to compute, one can devise an alternating optimization routine (minimizing w.r.t w and η alternatively until convergence). But this has convergence issues in general because the objective function is not smooth around vectors with zero elements, and

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

does not offer a convergence rate. Alternatively, under the same assumptions, we can pose (3) as a smooth problem in η alone, which can be solved using FISTA, but suffers from expensive per-step computation. The only generic (i.e., with no strong assumptions on F) ﬁrst-order algorithm is subgradient descent, which is very slow with a rate of O(1/

T) after T iterations. Thus, we are in need of efﬁcient algorithms to solve problem (3). We make the following contributions:

We investigate primal-dual algorithms for the problem (3). Though it is complicated to derive saddle point problems directly from (3), we propose a conic reformulation of the problem (3) in Section 5.1, which opens the possibilities for applying efﬁcient primal-dual procedures. Also, the cone constraints derived are separable over the set of variables involved, and have closed form solutions for projection, which is used by the primaldual algorithms.

When prox F X and proxΓ are easy to compute, we propose in Section 5.2, a primal-dual algorithm ADMM-η to solve problem (3) which converges at the rate of O(1/T) in T iterations (see Theorem 1). To achieve an ϵ-approximate solution with respect to the duality gap, ADMM-η takes O ((d + c F X + cΓ) /ϵ), where c F X and cΓ denote the complexity to evaluate the proximal operator for F X and Γ respectively. In general, c F X may be high because of the dependence on X. However for popular loss functions such as the squared loss and hinge loss, this computation turns out to be simple (see Table 1).

When only prox F and proxΓ are easy to compute, we propose a generic ﬁrst-order primal-dual algorithm CP-η in Section 5.3, which also converges at a rate of O(1/T) (See Theorem 2). To guarantee an ϵ-approximate solution, CP-η takes O ((nd + c F + cΓ) /ϵ) operations, where c F denotes the complexity to evaluate the proximal operator for F. In situations where c F X is high, CP-η serves as a effective alternative to ADMM-η, while also being the most generic and efﬁcient solution for (3).

In Section 6, we consider an example of subquadratic norm (See Example 1) which encourages tree sparsity. As studied in [Obozinski and Bach, 2016], proxΓ can be evaluated in O(d log d) time, whereas proxΩrequires O(d2) computations. We illustrate that ADMM-η and CP-η outperform existing benchmarks on this chosen example.

Notations. 1d (resp. 0d) denotes a vector in Rd with all 1 s (resp. 0 s). Given z Rd, we denote by D(z) the diagonal matrix formed with zi in the (i, i)th entry, and deﬁne Id = D(1d). Given a set A, deﬁne IA(a) = 0 if a A, and + otherwise. Given f : Rn R, the Fenchel dual of f deﬁned as f (α) = supx Rn x α f(x). The proximal operator of a function f at z deﬁned as proxf(z) = argminx 1 2 x z 2 + f(x), and we deﬁne proxτ f(z) = proxτf(z). We deﬁne prox F X(z) = argminw 1 2 w z 2 + F(Xw). Following [Nesterov, 2013], we deﬁne a function f as L-smooth if x, y Rn, f(y) f(x) + f(x) (y x) + L

We deﬁne f as µ-strongly convex if x, y Rn, f(y) f(x) + f(x) (y x) + µ

2 Subquadratic Norms Many existing studies have illustrated beneﬁt of using subquadratic norms for structured sparsity in many applications: compressed sensing and image denoising [Baldassarre et al., 2012], multi task learning [Mc Donald et al., 2016], banded covariance matrix estimation [Yan and Bien, 2015]. Next, we discuss speciﬁc examples where proxΓ is easy to compute. Example 1. Tree-Structured H-norms [Baldassarre et al., 2012; Micchelli et al., 2013]. Consider a rooted tree with d nodes represented by the edge matrix A Rd d, where Aei = 1, Aej = 1 if eth edge links i to j. Deﬁne H = {η Rd +|Aη Rd +}, and let Γ(η) = η 1d + 1H(η). With this choice of Γ, let us denote the norm (2) as ΩH. By enforcing the tree structured prior on η, ΩH thus encourages w to be tree structured. The proximal operator for ΩH may be computed approximately using Picard iterates [Baldassarre et al., 2012], which also lacks convergence rates. Whereas, we can compute proxΓ easily since the projection onto H is computed easily through a pooladjacent-violators (PAV) algorithm [Pardalos and Xue, 1999; Best and Chakravarti, 1990] in O(d log d) time. Example 2. Convex Relaxation of Combinatorial Penalties [Bach, 2010; Bach, 2011b; Obozinski and Bach, 2016]. Denote V = {1, . . . , d}. Given a tree structured ordering over V , deﬁne Gj = {j Dj}, where Dj denotes the descendants of j. We may arrive at a subquadratic regualarizer which encourages tree structures through the steps discussed next. Consider a set function S(A) = P Gj G 1Gj A = . S is submodular [Bach, 2011a], and its corresponding Lov asz extension denoted by Γ(η) equals P j ηGj , where ηGj is the restriction of η to the coordinates given by Gj. With this choice of Γ, we denote the norm (2) as ΩS 2 (w). proxΩis more expensive in this case: can be computed using a divide-andconquer strategy involving a sequence of submodular function minimization (SFM), whose complexity is d(O(SFM)). Whereas, proxΓ can be computed in time O(dh), where h is the depth of the tree[Jenatton et al., 2011b]. Comparing Examples 1 and 2, note that ΩS 2 (w) = ΩH(w) [Obozinski and Bach, 2016], illustrating that Γ need not be unique for a given subquadratic norm Ω. While these existing works on subquadratic norms illustrate the beneﬁts in terms of applicability, they do not focus on computational efﬁciency, which is the focus of this paper.

3 Existing Algorithms FISTA-η. Deﬁning a smooth function J(η) as follows:

J(η) = inf w Rd F(Xw) + λ

j=1 w2 j/ηj, (4)

problem (3) is equivalent to minη 0 J(η) + λ

2 Γ(η) on which we can use (accelerated) proximal gradient descent [Beck and Teboulle, 2009] with J as the smooth component, and

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Γ(η) as the non-smooth component, whose proximal operator will be needed. The key is the computation of the gradient of J, for which we solve (4). With the proper change of variable, the minimizer of (4) is equivalent to w = D(η)1/2prox1/λ F XD(η)1/2(0). When the Lipshitz constant of the gradient of J is not known, backtracking is used to ﬁnd the stepsize [Scheinberg et al., 2014], which increases the number of times the gradient of J is calculated. We denote this algorithm as FISTA-η. Note that the per-iteration cost may overweigh the beneﬁt of the fast convergence rate O(1/T 2) of FISTA-η.

Alt-η. A simple algorithm is to alternatively solve for w and η at each iteration. Given η, to solve for w, we get an ℓ2-regularized problem on w, whose solution is given as w = D(η)1/2prox1/λ F XD(η)1/2(0). Given w, solving for η is equivalent to evaluating the norm Ω(w) and returning the minimizer η. We refer to this algorithm as Alt-η. Though lacking in convergence guarantees and having issues with convergence when η has components close to 0, this algorithm is used in practice with good performances. Initializing η far from 0 is recommended or making sure that η stays away from 0 by adding a penalty ε Pd j=1 η 1 j [Canu and Grandvalet, 1999; Daubechies et al., 2010; Argyriou et al., 2008].

4 Primal-Dual Algorithms

Before discussing our proposed algorithms for solving (3), we brieﬂy discuss the generic primal-dual setup we work on in this paper. Following [Chambolle and Pock, 2011], for ﬁnitedimensional vector spaces U and V, we consider a generic primal problem of the form

minu U (ψ (u) := H (Ku) + G (u)) , (5)

where K : U V is a linear operator from U to V with the operator norm K = max{ Ku , u U, u 1}, H : V R {+ } and G : U R {+ } are proper convex functions, whose proximal operators are easily computable. This leads to the following saddle-point problem.

minu U maxv V Υ (u, v) := v Ku + G (u) H (v) . (6)

Additionally when H is (1/γ)-smooth, H is γ-strongly convex. The primal-dual algorithm (CP) [Chambolle and Pock, 2011] for (6) by is given in Algorithm 1, which guarantees that there exists positive R1, R2 such that after T steps the following bounds hold true:

T min i=1 ψ(u(i)) ψ(u ) (R1/T) , (7)

ψ(u(k)) ψ(u ) R2/ γ2T 2 , (8)

the second inequality being valid when F is γ-stronglyconvex. The key features of Algorithm 1 are as follows:

1. It requires as input prox G and prox H , initial primal and dual step-sizes τ (0), σ(0) satisfying τ (0)σ(0) K 2 1.

2. It accesses K through only matrix-vector products, and hence strictly ﬁrst-order in nature.

Algorithm 1 CP [Chambolle and Pock, 2011]

Require: K, prox G, prox H , u(0), v(0), T 1: Choose τ (0), σ(0) > 0 such that τ (0)σ(0) K 2 1. 2: Initialize u(0) = u(0)

3: for k = 0, 1, 2, ..., T 1 do 4: v(k+1) = proxσ(k)H (v(k) + σK u(k)) 5: u(k+1) = proxτ (k)G(u(k) τK v(k+1))

6: θ(k) = 1 1+2γτ (k) , τ (k) = θ(k)τ (k), σk = τ (k)

θ(k) 7: u(k+1) = u(k+1) + θ(k)(u(k+1) u(k)) 8: end for 9: Return (u(1:T ), v(1:T )).

3. The cost per iteration is cu + cv + cost(prox G) + cost(prox H), where cu and cv are the cost to compute Ku and K v respectively for vectors u U, v V.

4. coincides with ADMM, when K = I.

4.1 Step-Size Selection Given (u, v) U V, referring to Theorem 1 of [Chambolle and Pock, 2011], when γ = 0, the following bound holds for all σ(0) = σ, τ (0) = τ satisfying στ K 2 1.

Υ u(T ), v Υ u, v(T ) ˆRσ,τ (u, v) /T , with (9)

ˆRσ,τ(u, v) = u u(0) 2 2 2τ + v v(0) 2 2 2σ , (10)

t=1 u(t), v(T ) = 1

t=1 v(t), (11)

Let (u , v ) be an optimal solution of (6). Using the constraint στ K 2 1, we optimize the upper bound in (9) for the choice (u, v) = (u , v ) resulting in the value σ =

1 K v v(0) 2 u u(0) 2 . Since it is not possible to calculate σ in practice because we do not have access to u or v , we will compute rough estimates using the information we have. In the experiments, we derive the step-sizes for the squared loss F(z) = 1 2n z y 2 2 and the ℓ1 norm Γ(η) = η 1d. Note that the obtained step-sizes work well enough in our experiments, but that by an additional tuning, they could be improved.

5 Primal-Dual Formulation for Subquadratic Norms

We ﬁrst derive a conic reformulation of (3), which gets rid of

the ratio term w2 j 2ηj , enabling application of Algorithm 1.

5.1 Conic-Constrained Primal Reformulation

Note that the term w2 j 2ηj may be written as w2 j 2ηj = mintj tj such that tj 0, w2 j 2tjηj. Now, Eq. (3) is equivalent to:

minw,η,t Rd F(Xw) + λ

2 Γ(η) + λ1 d t + Pd j=1 IC(wj, ηj, tj), (12)

where C = {(a, b, c) R3, a2 2bc, b 0, c 0} is the rotated second order-cone. Note that the cone constraint is

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

separable across the sets of variables (wj, ηj, tj) and thus can be handled separately. The cone C is self-dual and the proximal operator for Pd j=1 IC(wj, ηj, tj), which will be the key to designing the primal-dual algorithms, involves computing the orthogonal projection onto C using a simple closed form expression. Thus, (12) makes it easy for deriving saddle-point problems of the form (6), which can be efﬁciently solved using Algorithm (1). In the next subsections, we consider reformulations of (12) under cases (a) prox F X is easy to compute, (b) only prox F is computable easily.

5.2 ADMM-η: when prox F X is Computable When prox F X is computable easily, we propose a primaldual algorithm ADMM-η to solve (12). Algorithm 2 lists the steps, which accepts proxΓ and prox F X inputs. The follow-

Algorithm 2 ADMM-η

Require: X, proxΓ, prox F X, k 1: Initialize u(0), v(0) R3d, K = I3d. (Ref. (15)). 2: Deﬁne functions G, H as in (16). 3: (u(k), v(k)) = CP(K, prox G, prox H , u(0), v(0), k) 4: w(k) = u(k)(1 : d), η(k) = u(k)(d + 1 : 2d). 5: Return w(k), η(k).

ing result proves that Algorithm 2 results in a rate of O(1/T) for the problem (3). Theorem 1. Let F : Rn R and Γ : Rn + R be convex, and let Ψ(w, η) be as deﬁned in (3) with (w , η ) = argminw,η Ψ(w, η), and u(k) = w(k) η(k) t(k) , v(k) = δ(k)β(k)γ(k)

k 0 be deﬁned as in Algorithm 2. Then, there exists a constant R > 0 such that after T iterations

T min i=1 Ψ(w(i), η(i)) Ψ(w , η ) R/T (13)

Proof. (Sketch) First, we can rewrite the cone constraint IC in (12) in terms of its Fenchel dual, which again is a similar cone constraint on the dual variables, arriving at the following saddle-point problem equivalent to (12):

min w,η,t Rd max δ,β,ν Rd

2 Γ(η) + λ1 d t

j=1 IC(δj, βj, νj) +

j=1 (δj, βj, νj), (wj, ηj, tj) . (14)

We can equate the above problem to (6) through the mapping of the variables u, v and functions G, H as given below.

u = w η t , u = δ β ν , K = I3d, (15)

G(u) = F(Xw) + λ

2 Γ(η) + λ1 d t,

j=1 IC(δj, βj, νj). (16)

Since the function G is separable in terms of the primal variables w, η, t, prox G simply reduces to computing the proximal operators independently on each group of variables.

prox G and prox H may be computed straightforward using prox F X and proxΓ. To evaluate prox G, the cost is usually dominated by the complexity to solve prox F X. Now result (7) applies, and since at optimality problem (12) and (3) have the same objective, we get the result.

Discussion. We can make the following remarks:

1. The per-step cost of ADMM-η is dominated by prox F X. For the square loss F(z) = 1 2n z y 2 2, it requires solving a linear system which is constant for all iterations and hence can be solved efﬁciently in O(d2) time, after a single O(d3) operation at the start of the algorithm. Whereas for the hinge loss F(z) = 1 n Pn i=1 max(0, 1 ziyi), [W. Kienzle, 2006] proposed an SMO-like algorithm with O(dn2) operations.

2. The assumption make by ADMM-η (regarding prox F X) can be compared with those for Alt-η and FISTA-η, both of which requiring prox F ˆ X, with ˆX = XD 1 2 (η). Since the matrix ˆX changes at every iteration, computing prox F ˆ X is more difﬁcult than prox F X. For instance, for the square loss, one needs to solve a different linear system at each iteration, which amounts to O(d3).

3. The step-size selection scheme we described in Section 5.3 leads to the following choices σ = X 2/(

The next section discusses a generic ﬁrst-order algorithm for (12) when prox F X is difﬁcult to compute.

5.3 CP-η: A generic ﬁrst-order algorithm For (3) While prox F X may be difﬁcult to obtain, prox F may be easier to compute (this is the case for all common loss functions). We derive a ﬁrst-order procedure denoted CP-η (Algorithm 3) having a convergence guarantee of O(1/T) for all losses and norms, where prox F and proxΓ are easy to compute.

Algorithm 3 CP-η

Require: X, proxΓ, prox F , k 1: Initialize u(0) R3d, v(0) Rn+3d (Ref. (20)). 2: Initialize r = X . 3: Deﬁne functions G, H as in (21).

4: K = X 0n 2d r I3d

5: (u(k), v(k)) =CP(K, prox G, prox H , u(0), v(0), k) 6: w(k) = u(k)(1 : d), η(k) = u(k)(d + 1 : 2d). 7: Return w(k), η(k)

Theorem 2. Let Ψ(w, η) be as deﬁned in (3) with (w , η ) = argminw,η Ψ(w, η), and u(k) = w(k) η(k) t(k) , v(k) = α(k) δ(k) β(k) ν(k) , K be deﬁned as in Algorithm 3. Then λ > 0, there exists a constant R > 0 such that after T iterations

k min i=1 Ψ(w(i), η(i)) Ψ(w , η ) R/T. (17)

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

Proof. (Sketch) Similar to arriving at (6) from (5) through the Fenchel dual of the loss function F, we apply the same trick in (12) to get the following equivalent problem.

min w,η,t Rd max α Rn α Xw F (α) + λ

2 Γ(η) + λ1 d t

j=1 IC(wj, ηj, tj). (18)

For the problem (18), we now use Fenchel duality on the constraint IC and arrive at the following problem.

min w,η,t Rd max α Rn,δ,β,ν Rd α Xw F (α) + λ

2 Γ(η) + λ1 d t

j=1 ( (δj, βj, νj), (wj, ηj, tj) IC(δj, βj, νj)) . (19)

Now, the mapping of variables in this formulation to Eq. (6) is given by

u = w , η , t , v = α , δ , β , ν ,

K = X 0n 2d r I3d

2 Γ(η) + λ1 d t, and (20)

H (v) = F (α) + r

j=1 IC(δj, βj, νj). (21)

Here, we have introduced the constant r to balance the scale of α against κ, with the choice r = X in experiments, since K 2 = X X + r2I 2 X 2 + r2. Algorithm 1 applies and we get Algorithm 3 to solve the problem (12). We see that G is separable function in terms of its variables w, η and t. And so is H separable in terms of α and the triplets (δj, βj, νj) for all j. This makes it easy to compute the proximal operators needed for G and H . Now the result (7) applies, and we get the result.

Discussion. We can make the following remarks:

1. Algorithm 3 is the ﬁrst ever O(1/T) efﬁcient ﬁrst-order algorithm for (3), since it accesses X only through matrix-vector multiplications.

2. One may also equate problems (18) and (6) with the following mappings of u, v, G, H , for which CP achieves O(1/T 2) convergence rate when F is smooth (see (8)).

u = w η t , v = α, K = [X 0n 2d],

H (v) = F (α),

2 Γ(η) + λ1 d t +

j=1 IC(wj, ηj, tj).

But, prox G is not easy to compute except for simple norms like the ℓ1 or grouped-ℓ1 norm, making the algorithm impractical for general norms.

3. We derive the following step size choices for CP-η: σ = 1/n and τ = n/(2 X 2).

Algorithm Fsq(z) FH(z) FISTA-η (d3 + nd)/ ϵ (dn2 + d log d)/ ϵ ADDM-η d2/ϵ (dn2 + d log d)/ϵ CP-η (nd + d log d)/ϵ (nd + d log d)/ϵ

Table 1: Number of operations (in Big-O notation) needed to guarantee duality gap of (3) ϵ, for ΩH as in Example 1. Fsq(z) = 1 2 z y 2 2 is the square loss, FH(z) = Pn i=1 max(1 yizi, 0) is the hinge loss.

5.4 General Discussion We can differentiate the various algorithms for the problem (3) as follows: Between Alt-η and FISTA-η, the former does not have any known convergence rates, whereas the latter has a convergence rate of O(1/T 2). But the per-step cost of FISTA-η is much higher because of backtracking. As seen in experiments the backtracking cost is quite high which leads to FISTA-η being impractical to use for general losses and norms. As discussed before, ADMM-η is similar to FISTA-η and Alt-η, since all of these assume easy computability of prox F X, with X = X for ADMM-η and XD 1 2 (η) for the other two algorithms. This subtle difference does make an impact, especially for losses like the squared loss, for which the proximal operator for ADMM-η is easier to obtain than those for FISTA-η and Alt-η. When prox F X is not very efﬁcient to compute, the only choice we have is CP-η, which works with all norms and losses with easy to compute prox F and proxΓ. When prox F X is easy to compute, the choice of the algorithm depends on the cost of proxΓ. Cheaper the cost of proxΓ computation, better is CP-η in performance compared to ADMM-η. It is because of the experimental observation that ADMM-η takes far fewer iterations than CP-η, and costly proxΓ does make CP-η run longer. ADMM-η is observed to be less sensitive to the conditioning of X. This may be because the inner CP routine does not depend on X explicitly, which is taken care of by prox F X as a black box. The dependence on X within the CP routine comes only through the step sizes τ and σ. This makes ADMM-η preferred in illconditioned cases.

6 Experiments To illustrate the efﬁciency of CP-η and ADMM-η over existing algorithms, we choose the aforementioned tree-sparsity inducing norm ΩH (Example 1), which is popular in wavelet coefﬁcients estimation. As discussed earlier, Ω2 S(w) = ΩH(w), but the primal-dual procedures are different because of the difference in the function Γ. Since time complexity to compute proxΓ is identical in both cases (O(d log d)), we choose only ΩH for the purpose of simulations. We include the following solvers: (1) FISTA-η, (2) Alt-η, (3) FISTA1

1The non-accelerated variant ISTA was very slow in experiments and hence excluded from the results.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

0 500 1000 1500

ADMMη CP-η FISTA-η Alt-η FISTA-w FISTA-w (p)

Log10(Duality Gap) Log10(Duality Gap)

Log10(Duality Gap)

Log10(Iterations) Log10(Iterations)

Time (secs)

Figure 1: Duality gap convergence for squared loss with treestructured norm (Example 1). Top: all algorithms. Bottom left: algorithms using prox F , Bottom right: algorithms using prox F X.

on primal (1) with proxΩcomputed using (a) decomposition algorithm (O(d2)) given in [Obozinski and Bach, 2016], denoted as FISTA-w, (b) Picard iterations [Baldassarre et al., 2012], denoted as FISTA-w(p). Setup. We perform numerical simulations2 by generating synthetic data. Following [Bach et al., 2011], we generate X Rn d as Xij N(0, 1). For the ground truth model w , we assumed a tree structure with uniform branching factor of 4 and a depth k. We generated w uniformly at random from [0, 1]d and set s = 0.75d randomly chosen indices to 0 satisfying the tree structure. The labels y were generated as y = Xw + ξ for the squared loss examples, and y = sign(Xw + ξ) for the hinge loss examples, where ξ is a standard Gaussian noise. We ﬁxed n = 1000, d = 15000, λ = 0.01, and the convergence criteria was the relative duality gap (with threshold ϵ = 10 4).

6.1 Squared Loss In this case, prox F X can be efﬁciently computed using matrix-vector multiplications in each iteration, qualifying ADDM-η as a ﬁrst-order algorithm. We make the following inferences from the simulation plots given in Figure 1.

All the solutions for the problem (3) do better than that of (1). Both FISTA-w and FISTA-w-p do not converge within a limit of 30 minutes, the latter though being faster than the former suffers from approximate solutions at each step.

Both CP-η and ADMM-η are better than all the compared algorithms in running time, justifying the claims made in the paper.

In general ADMM-η is faster than CP-η because of efﬁcient computation of prox F X and overall lesser number of iterations.

2Conducted on a Ubuntu PC with Core i7 processor, 8G RAM.

0 500 1000 1500

ADMMη CP-η FISTA-η Alt-η

Time (secs)

Log10(Iterations) Log10(Iterations)

Log10(Duality Gap) Log10(Duality Gap)

Log10(Duality Gap)

Figure 2: Duality gap convergence for hinge loss with treestructured norm (Example 1). Top: all algorithms. Bottom left: algorithms using prox F , Bottom right: algorithms using prox F X.

ADMM-η CP-η Sq. loss 458 / 341s / 0.53s 9260 / 734s / 0.08s Hinge.loss 360 / 646s / 1.97s 12810 / 923s / 0.07s

Table 2: No. of Iterations/Total time/Time-per-iteration

6.2 Hinge Loss prox F X is computed in this case through solving an SVM [W. Kienzle, 2006], which is based on an SMO algorithm (O(dn2)). Hence CP-η is the only ﬁrst-order algorithm in this case. Similar to the squared loss case, both the proposed algorithms perform better than the rest as shown in Figure 2.

6.3 Comparison Of CP-η And ADMM-η Although Table 2 may suggest that ADMM-η is better than CP-η for both loss functions, this is not always true, as it depends on d and n (see Table 1). When d increases, ADMM-η becomes expensive for square loss due to the O(d2) complexity. For hinge loss, both ADMM-η and CP-η have linear dependency on d, and hence theoretically, CP-η has no advantage over ADMM-η in very high dimensional scalings, for a ﬁxed n. However, for ADMM-η, due to quadratic dependency on n (see Table 1), the per-step cost goes high on large sample settings. This dependency on n for ADMMη is justiﬁed in Table 2, where we see that the per-step cost for ADMM-η goes much higher for hinge loss compared to square loss.

7 Conclusions and Future Directions

We studied efﬁcient primal-dual algorithms to learn with Subquadratic norms. One may also investigate for potential extensions of these algorithms towards general reweighted least-squares formulations [Bach et al., 2011] for norms. Study of inexact proximal opeartors for subquadratic norms also provides alternate directions.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)

References [Argyriou et al., 2008] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243 272, 2008. [Bach et al., 2011] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1):1 106, 2011. [Bach, 2010] F. Bach. Structured sparsity-inducing norms through submodular functions. In Advances in Neural Information Processing Systems (NIPS), 2010. [Bach, 2011a] F. Bach. Learning with submodular functions: A convex optimization perspective. Foundations and Trends in Machine Learning, 6-2-3:145 373, 2011. [Bach, 2011b] F. Bach. Shaping level sets with submodular functions. In Advances in Neural Information Processing Systems (NIPS), 2011. [Baldassarre et al., 2012] Luca Baldassarre, Jean Morales, Andreas Argyriou, and Massimiliano Pontil. A general framework for structured sparsity via proximal optimization. In Artiﬁcial Intelligence and Statistics, pages 82 90, 2012. [Beck and Teboulle, 2009] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Sci., 2(1):183 202, March 2009. [Best and Chakravarti, 1990] M. J. Best and N. Chakravarti. Active set algorithms for isotonic regression: a unifying framework. Mathematical Programming, 47(1):425 439, 1990. [Canu and Grandvalet, 1999] S. Canu and Y. Grandvalet. Outcomes of the equivalence of adaptive ridge with least absolute shrinkage. Adv. NIPS, 1999. [Chambolle and Pock, 2011] A. Chambolle and T. Pock. A ﬁrst-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis., 40(1):120 145, May 2011. [Daubechies et al., 2010] I. Daubechies, R. De Vore, M. Fornasier, and C. S. G unt urk. Iteratively reweighted least squares minimization for sparse recovery. Communications on Pure and Applied Mathematics, 63(1):1 38, 2010. [Jenatton et al., 2011a] R. Jenatton, A. Gramfort, V. Michel, G. Obozinski, F. Bach, and B. Thirion. Multi-scale mining of fmri data with hierarchical structured sparsity. Pattern Recognition in Neuro Imaging (PRNI), 2011. [Jenatton et al., 2011b] R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding. J. Mach. Learn. Res., 12:2297 2334, July 2011. [Kloft et al., 2011] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. lp-norm multiple kernel learning. Journal of Machine Learning Research, 12:953 997, Mar 2011. [Kyrillidis, 2016] Kyrillidis. Structured sparsity: discrete and convex approaches. Foundations and Trends in Machine Learning, 2016.

[Lee et al., 2006] Su-In Lee, Honglak Lee, Pieter Abbeel, and Andrew Y. Ng. Efﬁcient l1 regularized logistic regression. In Neural Information Processing Systems, 2006. [Mairal et al., 2014] J. Mairal, F. Bach, and J. Ponce. Sparse modeling for image and vision processing. Foundations and Trends in Computer Graphics and Vision, 8(2 3):85 283, 2014. [Mc Donald et al., 2016] Andrew M Mc Donald, Massimiliano Pontil, and Dimitris Stamos. New perspectives on k-support and cluster norms. Journal of Machine Learning Research, 17(155):1 38, 2016. [Micchelli et al., 2013] C. Micchelli, J. Morales, and M. Pontil. Regularizers for structured sparsity. Advances in Computational Mathematics, 38(3):455 489, 2013. [Nesterov, 2007] Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discussion Paper, 2007. [Nesterov, 2013] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013. [Obozinski and Bach, 2016] Guillaume Obozinski and Francis Bach. A uniﬁed perspective on convex structured sparsity: Hierarchical, symmetric, submodular norms and beyond. Technical report, HAL-01412385, December 2016. [Obozinski et al., 2011] G. Obozinski, L. Jacob, and J.P. Vert. Group lasso with overlaps: The latent group lasso approach. Ar Xiv preprint:1110.0413v1, 2011. [Pardalos and Xue, 1999] Panos M Pardalos and Guoliang Xue. Algorithms for a class of isotonic regression problems. Algorithmica, 23(3):211 222, 1999. [Parikh and Boyd, 2014] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):123 231, 2014. [Sankaran et al., 2017] Raman Sankaran, Francis Bach, and Chiranjib Bhattacharya. Identifying groups of strongly correlated variables through smoothed ordered weighted l 1-norms. In Artiﬁcial Intelligence and Statistics, pages 1123 1131, 2017. [Scheinberg et al., 2014] Katya Scheinberg, Donald Goldfarb, and Xi Bai. Fast ﬁrst-order methods for composite convex optimization with backtracking. Foundations of Computational Mathematics, 14(3):389 417, 2014. [Tibshirani, 1994] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267 288, 1994. [Vapnik, 2000] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 2000. [W. Kienzle, 2006] K. Chellapilla W. Kienzle. Personalized handwriting recognition via biased regularization. In Proceedings of the International Conference on Machine Learning(ICML), 2006. [Yan and Bien, 2015] Xiaohan Yan and Jacob Bien. Hierarchical sparse modeling: A choice of two group lasso formulations. ar Xiv preprint ar Xiv:1512.01631, 2015.

Proceedings of the Twenty-Ninth International Joint Conference on Artiﬁcial Intelligence (IJCAI-20)