# orbit_regularization__8d8dc9b3.pdf

Orbit Regularization

Renato Negrinho Instituto de Telecomunicac oes Instituto Superior T ecnico 1049 001 Lisboa, Portugal renato.negrinho@gmail.com

Andr e F. T. Martins Instituto de Telecomunicac oes Instituto Superior T ecnico 1049 001 Lisboa, Portugal atm@priberam.pt

We propose a general framework for regularization based on group-induced majorization. In this framework, a group is deﬁned to act on the parameter space and an orbit is ﬁxed; to control complexity, the model parameters are conﬁned to the convex hull of this orbit (the orbitope). We recover several well-known regularizers as particular cases, and reveal a connection between the hyperoctahedral group and the recently proposed sorted ℓ1-norm. We derive the properties a group must satisfy for being amenable to optimization with conditional and projected gradient algorithms. Finally, we suggest a continuation strategy for orbit exploration, presenting simulation results for the symmetric and hyperoctahedral groups.

1 Introduction

The main motivation behind current sparse estimation methods and regularized empirical risk minimization is the principle of parsimony, which states that simple explanations should be preferred over complex ones. Traditionally, this has been done by deﬁning a function Ω: V R that evaluates the complexity of a model w V and trades off this quantity with a data-dependent term. The penalty function Ωis often designed to be a convex surrogate of an otherwise non-tractable quantity, a strategy which has led to important achievements in sparse regression [1], compressed sensing [2], and matrix completion [3], allowing to successfully recover parameters from highly incomplete information. Prior knowledge about the structure of the variables and the intended sparsity pattern, when available, can be taken into account when designing Ωvia sparsity-inducing norms [4]. Performance bounds under different regimes have been established theoretically [5, 6], contributing to a better understanding of the success and failure modes of these techniques.

In this paper, we introduce a new way to characterize the complexity of a model via the concept of group-induced majorization. Rather than regarding complexity in an absolute manner via Ω, we deﬁne it relative to a prototype model v V , by requiring that the estimated model w satisﬁes

where G is an ordering relation on V induced by a group G. This idea is rooted in majorization theory, a well-established ﬁeld [7, 8] which, to the best of our knowledge, has never been applied to machine learning. We therefore review these concepts in 2, where we show that this formulation subsumes several well-known regularizers and motivates new ones. Then, in 3, we introduce two important properties of groups that serve as building blocks for the rest of the paper: the notions of matching function and region cones. In 4, we apply these tools to the permutation and signed permutation groups, unveiling connections with the recent sorted ℓ1-norm [9] as a byproduct. In 5 we turn to algorithmic considerations, pinpointing the group-speciﬁc operations that make a group amenable to optimization with conditional and projected gradient algorithms.

Also at Priberam Labs, Alameda D. Afonso Henriques, 41 - 2 , 1000 123, Lisboa, Portugal.

Figure 1: Examples of orbitopes for the orthogonal group O(d) (left) and the hyperoctahedral group P (right). Shown are also the corresponding region cones, which in the case of O(d) degenerates into a ray.

A key aspect of our framework is a decoupling in which the group G captures the invariances of the regularizer, while the data-dependent term is optimized in the group orbitopes. In 6, we build on this intuition to propose a simple continuation algorithm for orbit exploration. Finally, 7 shows some simulation results, and we conclude in 8.

2 Orbitopes and Majorization

2.1 Vector Spaces and Groups

Let V be a vector space with an inner product , . We will be mostly concerned with the case where V = Rd, i.e., the d-dimensional real Euclidean space, but some of the concepts introduced here generalize to arbitrary Hilbert spaces. A group is a set G endowed with an operation : G G G satisfying closure (g h G, g, h G), associativity ((f g) h = f (g h), f, g, h G), existence of identity ( 1G G such that 1G g = g 1G = g, g G), and existence of inverses (each g G has an inverse g 1 G such that g g 1 = g 1 g = 1G). Throughout, we use boldface letters u, v, w, . . . for vectors, and g, h, . . . for group elements. We also omit the group operation symbol, writing gh instead of g h.

2.2 Group Actions, Orbits, and Orbitopes

A (left) group action of G on V [10] is a function ψ : G V V satisfying ψ(g, ψ(h, v)) = ψ(g h, v) and ψ(1G, v) = v for all g, h G and v V . When the action is clear from the context, we omit the letter ψ, writing simply gv for the action of the group element g on v, instead of ψ(g, v). In this paper, we always assume our actions are linear, i.e., g(c1v1 + c2v2) = c1gv1 + c2gv2 for scalars c1 and c2 and vectors v1 and v2. In some cases, we also assume they are norm-preserving, i.e., gv = v for any g G and v V . When V = Rd, we may regard the groups underlying these actions as subgroups of the general linear group GL(d) and of the orthogonal group O(d), respectively. GL(d) is the set of d-by-d invertible matrices, and O(d) the set of d-by-d orthogonal matrices {U Rd d | U U = UU = Id}, where Id denotes the d-dimensional identity matrix.

A group action deﬁnes an equivalence relation on V , namely w v iff there is g G such that w = gv. The orbit of a vector v V under the action of G is the set Gv := {gv | g G}, i.e., the vectors that result from acting on v with some element of G. Its convex hull is called the orbitope: OG(v) := conv(Gv). (2) Fig. 1 (left) illustrates this concept for the orthogonal group in R2. An important concept associated with group actions and orbitopes is that of G-majorization [7]:

Deﬁnition 1 Let v, w V . We say that w is G-majorized by v, denoted w G v, if w OG(v).

Proposition 2 If the group action is linear, then G is reﬂexive and transitive, i.e., it is a pre-order.

Proof: See supplemental material.

Group majorization plays an important role in the area of multivariate inequalities in statistics [11]. In this paper, we use this concept for representing model complexity, as described next.

2.3 Orbit Regularization

We formulate our learning problem as follows: minimize L(w) s.t. w G v, (3) where L : V R is a loss function, G is a given group, and v V is a seed vector. This formulation subsumes several well-known cases, outlined below.

ℓ2-regularization. If G := O(d) is the orthogonal group acting by multiplication, we recover ℓ2 regularization. Indeed, we have Gv = {Uv Rd | U O(d)} = {w Rd | w 2 = v 2}, for any seed v Rd. That is, the orbitope OG(v) = conv(Gv) becomes the ℓ2-ball with radius v 2. The only property of the seed that matters in this case is its ℓ2-norm.

Permutahedron. Let P be the symmetric group (also called the permutation group), which can be represented as the set of d-by-d permutation matrices. Given v Rd, the orbitope induced by v under P is the convex hull of all the permutations of v, which can be equivalently described as the vectors that are transformations of v through a doubly stochastic matrix:

OP(v) = conv{Pv | P P} = {Mv | M1 = 1, M 1 = 1, M 0}. (4)

This set is called the permutahedron [12]. We will revisit this case in 4.

Signed permutahedron. Let P be the hyperoctahedral group (also called the signed permutation group), i.e., the d-by-d matrices with entries in {0, 1} such that the sum of the absolute values in each row and column is 1. The action of P on Rd permutes the entries of vectors and arbitrarily switches signs. Given v Rd, the orbitope induced by v under P is:

OP (v) = conv{Diag(s)Pv | P P, s { 1}d}, (5)

where Diag(s) denotes a diagonal matrix formed by the entries of s. We call this set the signed permutahedron; it is depicted in Fig. 1 and will also be revisited in 4.

ℓ1 and ℓ -regularization. As a particular case of the signed permutahedron, we recover ℓ1 and ℓ balls by choosing seeds of the form v = γe1 (a scaled canonical basis vector) and v = γ1 (a constant vector), respectively, where γ is a scalar. In the ﬁrst case, we obtain the ℓ1-ball, OG(v) = γ conv({e1, . . . , ed}) and in the second case, we get the ℓ -ball OG(v) = γ conv({ 1}d).

Symmetric matrices with majorized eigenvalues. Let G := O(d) be again the orthogonal group, but now acting by conjugation on the vector space of d-by-d symmetric matrices, V = Sd. Given a seed v A Sd, its orbit is Gv = {UAU | U O(d)} = {U Diag(λ(A))U | U O(d)}, where λ(A) denotes a vector containing the eigenvalues of A in decreasing order (so we may assume without loss of generality that the seed is diagonal). The orbitope OG(v) becomes:

OG(v) := {B Sd | λ(B) P λ(A)}, (6)

which is the set of matrices whose eigenvalues are in the permutahedron OP(λ(A)) (see example above). This is called the Schur-Horn orbitope in the literature [8].

Squared matrices with majorized singular values. Let G := O(d) O(d) act on Rd d (the space of squared matrices, not necessarily symmetric) as g U,V A := UAV . Given a seed v A, its orbit is Gv = {UAV | U, V O(d)} = {U Diag(σ(A))V | U, V O(d)}, where σ(A) contains the singular values of A in decreasing order (so we may assume without loss of generality that the seed is diagonal and non-negative). The orbitope OG(v) becomes:

OG(v) := {B Rd d | σ(B) P σ(A)}, (7)

which is the set of matrices whose singular values are in the permutahedron OP(σ(A)).

Spectral and nuclear norm regularization. The previous case subsumes spectral and nuclear norm balls: indeed, for a seed A = γId, the orbitope becomes the convex hull of orthogonal matrices, which is the spectral norm ball {A Rd d | A 2 := σ1(A) γ}; while for a seed A = γ Diag(e1), the orbitope becomes the convex hull of rank-1 matrices with norm bounded by γ, which is the nuclear norm ball {A Rd d | A := P

i σi γ}. This norm has been widely used for low-rank matrix factorization and matrix completion [3].

Besides these examples, other regularization strategies, such as non-overlapping ℓ2,1 and ℓ ,1 norms [13, 4] can be obtained by considering products of the groups above. We omit details for space.

2.4 Relation with Atomic Norms

Atomic norms have been recently proposed as a toolbox for structured sparsity [6]. Let A V be a centrally symmetric set of atoms, i.e., v A iff v A. The atomic norm induced by A is deﬁned as w A := inf{t > 0 | w t conv(A)}. The corresponding atomic ball is the set {w | w A 1} = conv(A). Not surprisingly, orbitopes are often atomic norm balls.

Proposition 3 (Atomic norms) If G is a subgroup of the general linear group GL(d) and satisﬁes v Gv, then the set OG(v) is the ball of an atomic norm.

Proof: Under the given assumption, the set Gv is centrally symmetric, i.e., it satisﬁes w Gv iff w Gv (indeed, the left hand side implies that w = gv for some g G, and v Gv implies that v = hv for some h G, therefore, w = gh 1( v) = gh 1v Gv). As shown by Chandrasekaran et al. [6], this guarantees that . Gv satisﬁes the axioms of a norm.

Corollary 4 For any choice of seed, the signed permutahedron OP (v) and the orbitope formed by the squared matrices with majorized singular values are both atomic norm balls. If d is even and v is of the form v = (v+, v+), with v+ Rd/2 + , then the permutahedron OP(v) and the orbitope formed by the symmetric matrices with eigenvalues majorized by λ(v) are both atomic norm balls.

3 Matching Function and Region Cones

We now construct a unifying perspective that highlights the role of the group G. Two key concepts that play a crucial role in our analysis are that of matching function and region cone. In the sequel, these will work as building blocks for important algorithmic and geometric characterizations.

Deﬁnition 5 (Matching function) The matching function of G, m G : V V R, is deﬁned as: m G(u, v) := sup{ u, w | w Gv}. (8)

Intuitively, m G(u, v) aligns the orbits of u and v before taking the inner product. Note also that m G(u, v) = sup{ u, w | w OG(v)}, since we may equivalently maximize the linear objective over OG(v), which is the convex hull of Gv. We therefore have the following

Proposition 6 (Duality) Fix v V , and deﬁne the indicator function of the orbitope, IOG(v)(w) = 0 if w OG(v), and otherwise. The Fenchel dual of IOG(v) is m G(., v). As a consequence, letting L : V R is the Fenchel dual of the loss L, the dual problem of Eq. 3 is: maximize L ( u) m G(u, v) w.r.t. u V. (9)

Note that if . Gv is a norm (e.g., if the conditions of Prop. 3 are satisﬁed), then the statement above means that m G(., v) = . Gv is its dual norm. We will revisit this dual formulation in 4.

The following properties have been established in [14, 15].

Proposition 7 For any u, v V , we have: (i) m G(c1u, c2v) = c1c2m G(u, v) for c1, c2 0; (ii) m G(g1u, g2v) = m G(u, v) for g1, g2 G; (iii) m G(u, v) = m G(v, u). Furthermore, the following three statements are equivalent: (i) w G v, (ii) f(w) f(v) for all G-invariant convex functions f : V R, (iii) m G(u, w) m G(u, v) for all u V .

In the sequel, we always assume that G is a subgroup of the orthogonal group O(d). This implies that the orbitope OG(v) is compact for any v V (and therefore the sup in Eq. 8 can be replaced by a max), and that gv = v for any v V . Another important concept is that of the normal cone of a point w V with respect to the orbitope OG(v), denoted as NGv(w) and deﬁned as follows: NGv(w) := {u V | u, w w 0 w G v}. (10) Normal cones plays an important role in convex analysis [16]. The particular case of the normal cone at the seed v (illustrated in Fig. 1) is of great importance, as will be seen below.

Deﬁnition 8 (Region cone) Given v V , the region cone at v is KG(v) := NGv(v). It is the set of points that are maximally aligned with v in terms of the matching function: KG(v) = {u V | m G(u, v) = u, v }. (11)

4 Permutahedra and Sorted ℓ1-Norms

In this section, we focus on the permutahedra introduced in 2. Below, given a vector w Rd, we denote by w(k) its kth order statistic, i.e., we will sort w so that w(1) w(2) . . . w(d). We also consider the order statistics of the magnitudes |w|(k) by sorting the absolute values.

4.1 Signed Permutahedron

We start by deﬁning the sorted ℓ1-norm, proposed by Bogdan et al. [9] in their recent SLOPE method as a means to control the false discovery rate, and studied by Zeng and Figueiredo [17].

Deﬁnition 9 (Sorted ℓ1-norm) Let v, w Rd, with v1 v2 . . . vd 0 and v1 > 0. The sorted ℓ1-norm of w (weighted by v) is deﬁned as: w SLOPE,v := Pd j=1 vj|w|(j).

In [9] it is shown that . SLOPE,v satisﬁes the axioms of a norm. The rationale is that larger components of w are penalized more than smaller ones, in a way controlled by the prescribed v. For v = 1, we recover the standard ℓ1-norm, while the ℓ -norm corresponds to v = e1. Another special case is the OSCAR regularizer [18, 19], w OSCAR,τ1,τ2 := τ1 w 1 + τ2 P

i<j max{|wi|, |wj|}, corresponding to a linearly spaced v, vj = (τ1 + τ2(d j)) for j = 1, . . . , d. The next proposition reveals a connection between SLOPE and the atomic norm induced by the signed permutahedron.

Proposition 10 Let v Rd + be as in Def. 9. The sorted ℓ1-norm weighted by v and the atomic norm induced by the P -orbitope seeded at v are dual to each other: . P v = . SLOPE,v.

Proof: From Prop. 6, we have w P v = m P (w, v). Let P be a signed permutation matrix s.t. w := Pw has its components sorted by decreasing magnitude, | w|1 . . . | w|d. From Prop. 7, we have m P (w, v) = m( w, v) = | w|, v = w SLOPE,v.

The next proposition [7, 14] provides a characterization of the P -orbitope in terms of inequalities about the cumulative distribution of the order statistics.

Proposition 11 (Submajorization ordering) The orbitope OP (v) can be characterized as:

OP (v) = n w Rd P

j i |w|(j) P

j i |v|(j), i = 1, . . . , d o . (12)

Prop. 11 leads to a precise characterization of the atomic norm w P v, and therefore of the dual norm of SLOPE: w P v = maxi=1,...,d P

j i |w|(j)/P

j i |v|(j).

4.2 Permutahedron

The unsigned counterpart of Prop. 11 goes back to Hardy et al. [20].

Proposition 12 (Majorization ordering) The P-orbitope seeded at v can be characterized as:

OP(v) = n w Rd 1 w = 1 v P

j i v(j), i = 1, . . . , d 1 o . (13)

As seen in Corollary 4, if d is even and v = (v+, v+), with v 0, then w Pv qualiﬁes as a norm (we need to conﬁne to the linear subspace V := {w Rd | Pd j=1 wj = 0}). From Prop. 12, we have that this norm can be written as: w Pv = maxi=1,...,d 1 P

Proposition 13 Assume the conditions above hold and that v1 v2 . . . vd/2 0 and v1 > 0. The dual norm of . Pv is w Pv = Pd/2 j=1 vj(w(j) w(d j+1)).

Proof: Similar to the proof of Prop. 11.

5 Conditional and Projected Gradient Algorithms

Two important classes of algorithms in sparse modeling are the conditional gradient method [21, 22] and the proximal gradient method [23, 24]. Under Ivanov regularization as in Eq. 3, the latter reduces to the projected gradient method. In this section, we show that both algorithms are a good ﬁt for solving Eq. 3 for arbitrary groups, as long as the two building blocks mentioned in 3 are available: (i) a procedure for evaluating the matching function (necessary for conditional gradient methods) and (ii) a procedure for projecting onto the region cone (necessary for projected gradient).

1: Initialize w1 = 0 2: for t = 1, 2, . . . do 3: ut = arg maxu Gv L(wt), u 4: ηt = 2/(t + 2) 5: wt+1 = (1 ηt)wt + ηtut 6: end for

1: Initialize w1 = 0 2: for t = 1, 2, . . . do 3: Choose a stepsize ηt 4: a = wt ηt L(wt) 5: wt+1 = arg minw Gv w a 6: end for

Figure 2: Conditional gradient (left) and projected gradient (right) algorithms.

5.1 Conditional Gradient

The conditional gradient method is shown in Fig. 2 (left). We assume that a procedure is available for computing the gradient of the loss. The relevant part is the maximization in line 3, which corresponds precisely to an evaluation of the matching function m(s, v), with s = L(wt) (cf. Eq. 8). Fortunately, this step is efﬁcient for a variety of cases:

Permutations. If G = P, the matching function can be evaluated in time O(d log d) with a simple sort operation. Without losing generality, we assume the seed v is sorted in descending order (otherwise, pre-sort it before the main loop starts). Then, each time we need to evaluate m(s, v), we compute a permutation P such that Ps is also sorted. The minimizer in line 3 will equal P 1v.

Signed permutations. If G = P , a similar procedure with the same O(d log d) runtime also works, except that now we sort the absolute values, and set the signs of P 1v to match those of s.

Symmetric matrices with majorized eigenvalues. Let A = UAλ(A)U A Sd and B = UBλ(B)U B Sd, where the eigenvalues λ(A) and λ(B) are sorted in decreasing order. In this case, the matching function becomes m G(A, B) = max V O(d) trace(A V BV ) = λ(A), λ(B) due to von Neumann s trace inequality [25], the maximizer being V = UAU B . Therefore, we need only to make an eigen-decomposition and set B = UAλ(B)U A .

Squared matrices with majorized singular values. Let A = UAσ(A)V A Rd d and B = UBσ(B)V B Rd d, where the singular values are sorted. We have m G(A, B) = max U,V O(d) trace(A UBV ) = σ(A), σ(B) also from von Neumann s inequality [25]. To evaluate the matching function, we need only to make an SVD and set B = UAσ(B)V A .

5.2 Projected Gradient

The projected gradient algorithm is illustrated in Fig. 2 (right); the relevant part is line 5, which involves a projection onto the orbitope OG(v). This projection may be hard to compute directly, since the orbitope may lack a concise half-space representation. However, we next transform this problem into a projection onto the region cone KG(v) (the proof is in the supplemental material).

Proposition 14 Assume G is a subgroup of O(d). Let g G be such that a, gv = m G(a, v). Then, the solution of the problem in line 5 is w = a ΠKG(gv)(a gv).

Thus, all is necessary is computing the arg-max associated with the matching function, and a black box that projects onto the region cone KG(v). Again, this step is efﬁcient in several cases:

Permutations. If G = P, the region cone of a point v is the set of points w satisfying vi > vj wi wj, for all i, j 1, . . . , d. Projecting onto this cone is a well-studied problem in isotonic regression [26, 27], with existing O(d) algorithms.

Signed permutations. If G = P , this problem is precisely the evaluation of the proximity operator of the sorted ℓ1-norm, also solvable in O(d) runtime with a stack-based algorithm [9].

6 Continuation Algorithm

Finally, we present a general continuation procedure for exploring regularization paths when L is a convex loss function (not necessarily differentiable) and the seed v is not prescribed. The

Require: Factor ϵ > 0, interpolation parameter α [0, 1]

1: Initialize seed v0 randomly and set v0 = ϵ 2: Set t = 0 3: repeat 4: Solve wt = arg minw Gvt L(w) 5: Pick v t Gvt KG(wt) 6: Set next seed vt+1 = (1 + ϵ)(αv t + (1 α)wt) 7: t t + 1 8: until wt Gvt < 1. 9: Use cross-validation to choose the best bw {w1, w2, . . .}

Figure 3: Left: Continuation algorithm. Right: Reachable region WG for the hyperoctahedral group, with a reconstruction loss L(w) = w a 2. Only points v s.t. L(v) = a v KG(v) belong to this set. Different initializations of v0 lead to different paths along WG, all ending in a.

procedure outlined in Fig. 3 solves instances of Eq. 3 for a sequence of seeds v1, v2, . . ., using a simple heuristic for choosing the next seed given the previous one and the current solution.

The basic principle behind this procedure is the same as in other homotopy continuation methods [28, 29, 30, 31]: we start with very strong regularization (using a small norm ball), and then gradually weaken the regularization (increasing the ball) while tracking the solution. The process stops when the solution is found to be in the interior of the ball (the condition in line 8), which means the regularization constraint is no longer active. The main difference with respect to classical homotopy methods is that we do not just scale the ball (in our case, the G-orbitope); we also generate new seeds that shape the ball along the way. To do so, we adopt a simple heuristic (line 6) to make the seed move toward the current solution wt before scaling the orbitope. This procedure depends on the initialization (see Fig. 3 for an illustration), which drives the search into different regions. Reasoning in terms of groups, line 4 makes us move inside the orbits, while line 6 is an heuristic to jump to a nearby orbit. For any choice of ϵ > 0 and α [0, 1], the algorithm is convergent and produces a strictly decreasing sequence L(w1) > L(w2) > before it terminates (a proof is provided as supplementary material). We expect that, eventually, a seed v will be generated that is close to the true model bw. Although it may not be obvious at ﬁrst sight why would it be desirable that v bw, we provide a simple result below (Prop. 15) that sheds some light on this matter, by characterizing the set of points in V that are reachable by optimizing Eq. 3.

From the optimality conditions of convex programming [32, p. 257], we have that w is a solution of the optimization problem in Eq. 3 if and only if 0 L(w ) + NGv(w ), where L(w) denotes the subdifferential of L at w, and NGv(w) is the normal cone to OG(v) at w, deﬁned in 3. For certain seeds v V , it may happen that the optimal solution w of Eq. 3 is the seed itself. Let WG be the set of seeds with this property:

WG := {v V | L(v) L(w), w G v} = {v V | 0 L(v) + KG(v)}, (14)

where KG(v) is the region cone and the right hand side follows from the optimality conditions. We next show that this set is all we need to care about.

Proposition 15 Consider the set of points that are solutions of Eq. 3 for some seed v V , c WG := w V v V : w arg minw Gv L(w) . We have c WG = WG.

Proof: Obviously, v WG v c WG. For the reverse direction, suppose that w c WG, in which case there is some v V such that w G v and L(w ) L(w) for any w G v. Since G is a pre-order, it must hold in particular that L(w ) L(w) for any w G w G v. Therefore, we also have that w arg minw Gw L(w), i.e., w WG.

7 Simulation Results

We describe the results of numerical experiments when regularizing with the permutahedron (symmetric group) and the signed permutahedron (hyperoctahedral group). All problems were solved using the conditional gradient algorithm, as described in 5. We generated the true model bw Rd

Figure 4: Learning curves for the permutahedron and signed permutahedron regularizers with a perfect seed. Shown are averages and standard deviations over 10 trials. The baselines are ℓ1 (three leftmost plots, resp. with k = 150, 250, 400), and ℓ2 (last plot, with k = 500).

Figure 5: Mean squared errors in the training set (left) and the test set (right) along the regularization path. For the permutahedra regularizers, this path was traced with the continuation algorithm. The baseline is ℓ1 regularization. The horizontal lines in the right plot show the solutions found with validation in a held-out set.

by sampling the entries from a uniform distribution in [0, 1] and subtracted the mean, keeping k d nonzeros; after which bw was normalized to have unit ℓ2-norm. Then, we sampled a random nby-d matrix X with i.i.d. Gaussian entries and variance σ2 = 1/d, and simulated measurements y = X bw + n, where n N(0, σ2 n) is Gaussian noise. We set d = 500 and σn = 0.3σ.

For the ﬁrst set of experiments (Fig. 4), we set k {150, 250, 400, 500} and varied the number of measurements n. To assess the advantage of knowing the true parameters up to a group transformation, we used for the orbitope regularizers a seed in the orbit of the true bw, up to a constant factor (this constant, and the regularization constants for ℓ1 and ℓ2, were all chosen with validation in a held-out set). As expected, this information was beneﬁcial, and no signiﬁcant difference was observed between the permutahedron and the signed permutahedron. For the second set of experiments (Fig. 5), where the aim is to assess the performance of the continuation method, no information about the true model was given. Here, we ﬁxed n = 250 and k = 300 and ran the continuation algorithm with ϵ = 0.1 and α = 0.0, for 5 different initializations of v0. We observe that this procedure was effective at exploring the orbits, eventually ﬁnding a slightly better model than the one found with ℓ1 and ℓ2 regularizers.

8 Conclusions and Future Work

In this paper, we proposed a group-based regularization scheme using the notion of orbitopes. Simple choices of groups recover commonly used regularizers such as ℓ1, ℓ2, ℓ , spectral and nuclear matrix norms; as well as some new ones, such as the permutahedron and signed permutahedron. As a byproduct, we revealed a connection between the permutahedra and the recently proposed sorted ℓ1-norm. We derived procedures for learning with these orbit regularizers via conditional and projected gradient algorithms, and a continuation strategy for orbit exploration.

There are several avenues for future research. For example, certain classes of groups, such as reﬂection groups [33], have additional properties that may be exploited algorithmically. Our work should be regarded as a ﬁrst step toward group-based regularization we believe that the regularizers studied here are just the tip of the iceberg. Groups and their representations are well studied in other disciplines [10], and chances are high that this framework can lead to new regularizers that are a good ﬁt to speciﬁc machine learning problems.

Acknowledgments

We thank all reviewers for their valuable comments. This work was partially supported by FCT grants PTDC/EEI-SII/2312/2012 and PEst-OE/EEI/LA0008/2011, and by the EU/FEDER programme, QREN/POR Lisboa (Portugal), under the Intelligo project (contract 2012/24803).

[1] R. Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society B., pages 267 288, 1996. [2] D. Donoho. Compressed Sensing. IEEE Transactions on Information Theory, 52(4):1289 1306, 2006. [3] E. Cand es and B. Recht. Exact Matrix Completion via Convex Optimization. Foundations of Computational Mathematics, 9(6):717 772, 2009. [4] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex optimization with sparsity-inducing norms. In Optimization for Machine Learning. MIT Press, 2011. [5] S. Negahban, P. Ravikumar, M. Wainwright, and B. Yu. A Uniﬁed Framework for High-Dimensional Analysis of M-estimators with Decomposable Regularizers. In Neural Information Processing Systems, pages 1348 1356, 2009. [6] V. Chandrasekaran, B. Recht, P. Parrilo, and A. Willsky. The Convex Geometry of Linear Inverse Problems. Foundations of Computational Mathematics, 12(6):805 849, 2012. [7] A. Marshall, I. Olkin, and B. Arnold. Inequalities: Theory of Majorization and Its Applications. Springer, 2010. [8] R. Sanyal, F. Sottile, and B. Sturmfels. Orbitopes. Technical report, ar Xiv:0911.5436, 2009. [9] M. Bogdan, E. Berg, W. Su, and E. Cand es. Statistical estimation and testing via the ordered ℓ1 norm. Technical report, ar Xiv:1310.1969, 2013. [10] J. Serre and L. Scott. Linear Representations of Finite Groups, volume 42. Springer, 1977. [11] Y. Tong. Probability Inequalities in Multivariate Distributions, volume 5. Academic Press New York, 1980. [12] G. Ziegler. Lectures on Polytopes, volume 152. Springer, 1995. [13] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B (Statistical Methodology), 68(1):49, 2006. [14] M. Eaton. On group induced orderings, monotone functions, and convolution theorems. Lecture Notes Monograph Series, pages 13 25, 1984. [15] A. Giovagnoli and H. Wynn. G-Majorization with Applications to Matrix Orderings. Linear algebra and its applications, 67:111 135, 1985. [16] R.T. Rockafellar. Convex Analysis. Princeton University Press, 1970. [17] X. Zeng and M. A. T. Figueiredo. Decreasing weighted sorted ℓ1 regularization. Technical report, ar Xiv:1404.3184, 2014. [18] H. Bondell and B. Reich. Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR. Biometrics, 64(1):115 123, 2008. [19] L. Zhong and J. Kwok. Efﬁcient Sparse Modeling with Automatic Feature Grouping. IEEE Transactions on Neural Networks and Learning Systems, 23(9):1436 1447, 2012. [20] G. Hardy, J. Littlewood, and G. P olya. Inequalities. Cambridge University Press, 1952. [21] M. Frank and P. Wolfe. An Algorithm for Quadratic Programming. Naval research logistics quarterly, 3 (1-2):95 110, 1956. [22] M. Jaggi. Revisiting Frank-Wolfe: Projection-free Sparse Convex Optimization. In Proc. of the International Conference on Machine Learning, pages 427 435, 2013. [23] S.J. Wright, R. Nowak, and M. A. T. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7):2479 2493, 2009. [24] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009. [25] L. Mirsky. A trace inequality of john von neumann. Monatshefte f ur Mathematik, 79(4):303 306, 1975. [26] P. Pardalos and G. Xue. Algorithms for a Class of Isotonic Regression Problems. Algorithmica, 23(3): 211 222, 1999. [27] R. Luss, S. Rosset, and M. Shahar. Decomposing Isotonic Regression for Efﬁciently Solving Large Problems. In Neural Information Processing Systems, pages 1513 1521, 2010. [28] M.R. Osborne, B. Presnell, and B.A. Turlach. A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20:389 403, 2000. [29] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of statistics, 32: 407 499, 2004. [30] M. A. T. Figueiredo, R. Nowak, and S. Wright. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing, 1(4):586 597, 2007. [31] E. Hale, W. Yin, and Y. Zhang. Fixed-point continuation for l1-minimization: Methodology and convergence. SIAM Journal on Optimization, 19:1107 1130, 2008. [32] D.P. Bertsekas, A. Nedic, and A.E. Ozdaglar. Convex analysis and optimization. Athena Scientiﬁc, 2003. [33] A. Steerneman. g-majorization, group-induced cone orderings, and reﬂection groups. Linear Algebra and its Applications, 127:107 119, 1990. [34] J.J. Moreau. Fonctions convexes duales et points proximaux dans un espace hilbertien. CR de l Acad emie des Sciences de Paris S erie A Mathematics, 255:2897 2899, 1962.