# blended_matching_pursuit__f4d177fd.pdf

Blended Matching Pursuit

Cyrille W. Combettes Georgia Institute of Technology Atlanta, GA, USA cyrille@gatech.edu

Sebastian Pokutta Zuse Institute Berlin and TU Berlin Berlin, Germany pokutta@zib.de

Matching pursuit algorithms are an important class of algorithms in signal processing and machine learning. We present a blended matching pursuit algorithm, combining coordinate descent-like steps with stronger gradient descent steps, for minimizing a smooth convex function over a linear space spanned by a set of atoms. We derive sublinear to linear convergence rates according to the smoothness and sharpness orders of the function and demonstrate computational superiority of our approach. In particular, we derive linear rates for a large class of non-strongly convex functions, and we demonstrate in experiments that our algorithm enjoys very fast rates of convergence and wall-clock speed while maintaining a sparsity of iterates very comparable to that of the (much slower) orthogonal matching pursuit.

1 Introduction

Let H be a separable real Hilbert space, D H be a dictionary, and f : H R be a smooth convex function. In this paper, we aim at solving the problem:

Find a solution to min x H f(x) which is sparse relative to D. (1)

Together with fast convergence, achieving high sparsity, i.e., keeping the iterates as linear combinations of a small number of atoms in the dictionary D, is a primary objective and leads to better generalization, interpretability, and decision-making in machine learning. In signal processing, Problem (1) encompasses a wide range of applications, including compressed sensing, signal denoising, and information retrieval, and is often solved with the Matching Pursuit algorithm [Mallat and Zhang, 1993]. Our approach is inspired by the Blended Conditional Gradients algorithm [Braun et al.,

2019] which solves the constrained setting of Problem (1), i.e., minimizing f over the convex hull conv(D) of the dictionary, and is ultimately based on the Frank-Wolfe algorithm [Frank and Wolfe, 1956] a.k.a. Conditional Gradient algorithm [Levitin and Polyak, 1966]. It enhances the vanilla Frank-Wolfe algorithm by replacing the linear minimization oracle with a weak-separation oracle [Braun et al., 2017] and by blending the traditional Frank-Wolfe steps with laziﬁed Frank-Wolfe steps and projected gradient steps, while still avoiding projections. Frank-Wolfe algorithms are particularly well-suited for problems with a desired sparsity in the solution (see, e.g., Jaggi [2013] and the references therein) however, from an optimization perspective, although they approximate the optimal descent direction f(xt) via the linear minimization oracle v FW t arg min D f(xt), v , they move in the direction v FW t xt in order to ensure feasibility, which provides less progress.

An analogy between Frank-Wolfe algorithms and the unconstrained Problem (1) was proposed by Locatello et al. [2017]. They uniﬁed the Frank-Wolfe and Matching Pursuit algorithms, and proposed a Generalized Matching Pursuit algorithm (GMP) and an Orthogonal Matching Pursuit algorithm (OMP) for solving Problem (1), which descend in the directions v FW t . Essentially, Locatello et al. [2017] established that GMP corresponds to the vanilla Frank-Wolfe algorithm and OMP corresponds to the Fully-Corrective Frank-Wolfe algorithm. GMP and OMP converge with similar rates in the various regimes, namely with a sublinear rate for smooth convex functions and with a linear rate for

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

smooth strongly convex functions, however they have different advantages: GMP converges (much) faster in wall-clock time while OMP offers (much) sparser iterates. The interest in these algorithms stems from the fact that they work in the general setting of smooth convex functions in Hilbert spaces and that their convergence analyses do not require incoherence or restricted isometry properties (RIP, Candès and Tao [2005]) of the dictionary, which are quite strong assumptions from an optimization standpoint. For an in-depth discussion of the advantages of GMP and OMP over other methods, e.g., in Tropp [2004], Gribonval and Vandergheynst [2006], Davenport and Wakin [2010], Shalev-Shwartz et al. [2010], Temlyakov [2013, 2014, 2015], Tibshirani [2015], Yao and Kwok [2016], and Nguyen and Petrova [2017], we refer the interested reader to Locatello et al. [2017]. In a follow-up work, Locatello et al. [2018] presented an Accelerated Matching Pursuit algorithm.

We aim at unifying the best of GMP (speed) and OMP (sparsity) into a single algorithm by blending them strategically. However, while the overall idea is reasonably natural, we face considerable challenges as many important features of Frank-Wolfe methods do not apply anymore in the Matching Pursuit setting and cannot be as easily overcome as in Locatello et al. [2017], requiring a different analysis. For example, Frank-Wolfe (duality) gaps are not readily available but they are crucial in monitoring the blending, and further key components, such as the weak-separation oracle, require modiﬁcations.

Contributions. We propose a Blended Matching Pursuit algorithm (BMP), a fast and sparse ﬁrstorder method for solving Problem (1). Our method uniﬁes the best of GMP (speed) and OMP (sparsity) into one algorithm, which is of fundamental interest for practitioners. We establish a continuous range of convergence rates between O(1/ϵp) and O(ln 1/ϵ), where ϵ > 0 is the desired accuracy and p > 0 depends on the properties of the function. In particular, we derive linear rates of convergence for a large class of smooth convex but non-strongly convex functions. Lastly, we demonstrate the computational superiority of BMP over state-of-the-art methods, with BMP converging the fastest in wall-clock time while maintaining its iterates at close-to-optimal sparsity, and this without requiring sparsity-inducing constraints.

Outline. We introduce notions and notation in Section 2. We present the Blended Matching Pursuit algorithm in Section 3 with the convergence analyses in Section 3.1. Computational experiments are provided in Section 4. Additional experiments and the proofs can be found in the Appendix.

2 Preliminaries

We work in a separable real Hilbert space (H, , ) with induced norm . A set D H of normalized vectors is a dictionary if it is at most countable and cl(span(D)) = H, and in this case its elements are referred to as atoms. For any set S H, let S := S S denote the symmetrization of S and DS := supu,v S u v denote the diameter of S. If S is closed and convex, let proj S denote the orthogonal projection onto S and dist( , S) := id proj S denote the distance to S. For Problem (1) to be feasible, we will assume f to be coercive, i.e., lim x + f(x) = + . Since f is convex, this is actually a mild assumption when arg min H f = .

Let f : H R be a Fréchet differentiable function. In the following, we use extended notions of smoothness and strong convexity by introducing orders, and we weaken and generalize the notion of strong convexity to that of sharpness (see, e.g., Roulet and d Aspremont [2017] and Kerdreux et al. [2019] for recent work). We say that f is:

(i) smooth of order ℓ> 1 if there exists L > 0 such that for all x, y H,

f(y) f(x) f(x), y x L

(ii) strongly convex of order s > 1 if there exists S > 0 such that for all x, y H,

f(y) f(x) f(x), y x S

(iii) sharp of order θ ]0, 1[ on K if K H is a bounded set, = arg min H f int(K), and there exists C > 0 such that for all x K,

dist x, arg min H f C f(x) min H f θ .

If needed, we may specify the constants by introducing f as L-smooth, S-strongly convex, or C-sharp. The following fact, whose result was already used in Nemirovskii and Nesterov [1985], provides a bound on the sharpness order of a smooth function. A proof is available in Appendix D. Fact 2.1. Let f : H R be smooth of order ℓ> 1, convex, and sharp of order θ ]0, 1[ on K. Then θ ]0, 1/ℓ].

2.1 On sharpness and strong convexity

Notice that if f : H R is Fréchet differentiable and strongly convex of order s > 1, then card(arg min H f) = 1. Let {x } := arg min H f. It follows directly from f(x ) = 0 that for any bounded set K H such that x int(K), f is sharp of order θ = 1/s on K. Thus, strong convexity implies sharpness. However, not every sharp function is strongly convex; moreover, the next example shows that not every sharp and convex function is strongly convex. Example 2.2 (Distance to a convex set). Let C H be a nonempty, closed, and bounded convex set, and K H be a bounded set such that C int(K). The function f : x H 7 dist(x, C)2 = x proj C(x) 2 is convex, and it is sharp of order θ = 1/2 on K. Indeed, since arg min H f = C and min H f = 0, we have for all x K,

dist x, arg min H f = x proj C(x) = f(x) min H f 1/2 .

Now, suppose C contains more than one element. Then, f has more than one minimizer. However, a function that is strongly convex of order s > 1 has no more than one minimizer. Therefore, f cannot be strongly convex of order s, for all s > 1. Notice that f is also a smooth function, of order ℓ= 2.

Hence, sharpness is a more general notion of strong convexity. It is a local condition around the optimal solutions while strong convexity is a global condition. In fact, building on the Łojasiewicz inequality of Łojasiewicz [1963], [Bolte et al., 2007, Equation (15)] showed that sharpness always holds in ﬁnite dimensional spaces for reasonably well-behaved convex functions; see Lemma 2.3. Polynomial convex functions, the ℓp-norms, the Huber loss (see Appendix A.4), and the rectiﬁer Re LU are simple examples of such functions. Lemma 2.3. Let f : Rn ] , + ] be a lower semicontinuous, convex, and subanalytic function with {x Rn | 0 f(x)} = . Then for any bounded set K Rn, there exists θ ]0, 1[ and C > 0 such that for all x K,

dist x, arg min Rn f C f(x) min Rn f θ .

Strong convexity is a standard requirement to prove linear convergence rates on smooth convex objectives but, regrettably, this considerably restricts the set of candidate functions. For our Blended Matching Pursuit algorithm, we will only require sharpness to establish linear convergence rates, thus including a larger class of functions.

2.2 Matching Pursuit algorithms

For y H and f : x H 7 y x 2/2, Problem (1) falls in the area of sparse recovery and is often solved with the Matching Pursuit algorithm [Mallat and Zhang, 1993]. The algorithm recovers a sparse representation of the signal y from the dictionary D by sequentially pursuing the best matching atom. At each iteration, it searches for the atom vt D most correlated with the residual y xt, i.e., vt := arg maxv D | y xt, v |, and adds it to the linear decomposition of the current iterate xt to form the new iterate xt+1, keeping track of the active set St+1 = St {vt}. However, this does not prevent the algorithm from selecting atoms that have already been added in earlier iterations or that are redundant, hence affecting sparsity. The Orthogonal Matching Pursuit variant [Pati et al., 1993, Davis et al., 1994] overcomes this by computing the new iterate as the projection of the signal y onto St {vt}; see Chen et al. [1989] and Tropp [2004] for analyses and Zhang [2009] for an extension to the stochastic case. Thus, y xt+1 becomes orthogonal to the active set.

In order to solve Problem (1) for any smooth convex objective, Locatello et al. [2017] proposed the Generalized Matching Pursuit (GMP) and Generalized Orthogonal Matching Pursuit (GOMP) algorithms (Algorithm 1); slightly abusing notation we will refer to the latter simply as Orthogonal

Matching Pursuit (OMP). The atom selection subroutine is implemented with a Frank-Wolfe linear minimization oracle arg minv D f(xt), v (Line 3). The solution vt to this oracle is guaranteed to be a descent direction as it satisﬁes f(xt), vt 0 by symmetry of D , and f(xt), vt = 0 if and only if xt arg min H f. Notice that for y H and f : x H 7 y x 2/2, the GMP and OMP variants of Algorithm 1 recover the original Matching Pursuit and Orthogonal Matching Pursuit algorithms respectively. In particular, up to a sign which does not affect the sequence of iterates, arg maxv D | y xt, v | arg minv D f(xt), v . In practice, the main difference in the case of general smooth convex functions is that the OMP variant (Line 6) is much more expensive, as a closed-form solution to this projection step is not available anymore. Hence, Line 6 is typically a sequence of projected gradient steps and OMP is signiﬁcantly slower than GMP to converge.

Algorithm 1 Generalized/Orthogonal Matching Pursuit (GMP/OMP) Input: Start atom x0 D, number of iterations T N . Output: Iterates x1, . . . , x T span(D).

1: S0 {x0} 2: for t = 0 to T 1 do 3: vt arg min v D f(xt), v

4: St+1 St {vt} 5: GMP variant: xt+1 arg min xt+Rvt f

6: OMP variant: xt+1 arg min span(St+1) f

2.3 Weak-separation oracle

We present in Oracle 2 the weak-separation oracle, a modiﬁed version of the one ﬁrst introduced in Braun et al. [2017] and used in, e.g., Lan et al. [2017], Braun et al. [2019]. Note that the modiﬁcation asks for an unconstrained improvement, whereas the original weak-separation oracle required an improvement relative to a reference point. As such, our variant here is even simpler than the original weak-separation oracle. The oracle is called in Line 11 by the Blended Matching Pursuit algorithm.

Oracle 2 Weak-separation LPsep D(c, φ, κ) Input: Linear objective c H, objective value φ 0, accuracy κ 1. Output: Either atom v D such that c, v φ/κ (positive call), or false ensuring c, z φ for all z conv(D) (negative call).

The weak-separation oracle determines whether there exists an atom v D such that c, v φ/κ, and thereby relaxes the Frank-Wolfe linear minimization oracle. If not, then this implies that conv(D) can be separated from the ambient space by c and φ with the linear inequality c, z φ for all z conv(D). In practice, the oracle can be efﬁciently implemented using caching, i.e., ﬁrst testing atoms that were already returned during previous calls as they may satisfy the condition here again. In this case, caching also preserves sparsity. If no active atom satisﬁes the condition, the oracle can be solved, e.g., by means of a call to a linear optimization oracle; see Braun et al. [2017] for an in-depth discussion. Lastly, we would like to brieﬂy note that the parameter κ can be used to further promote positive calls over negative calls, by weakening the improvement requirement and therefore speeding up the oracle. Indeed, only negative calls need a full scan of the dictionary.

3 The Blended Matching Pursuit algorithm

We now present our Blended Matching Pursuit algorithm (BMP) in Algorithm 3. Note that although we blend steps, we maintain the explicit decomposition of the iterates xt = Pnt j=1 λt,ijaij as linear combinations of the atoms. Remark 3.1 (Algorithm design). BMP actually does not require the atoms to have exactly the same norm and only needs the dictionary to be bounded, whether it be for ensuring the convergence rates or for computations; one could further take advantage of this to add weights to certain atoms. Line 6

Algorithm 3 Blended Matching Pursuit (BMP) Input: Start atom x0 D, parameters η > 0, κ 1, and τ > 1, number of iterations T N . Output: Iterates x1, . . . , x T span(D).

1: S0 {x0} 2: φ0 min v D f(x0), v /τ

3: for t = 0 to T 1 do 4: v FW-S t arg min v S t f(xt), v

5: if f(xt), v FW-S t φt/η then

6: e f(xt) projspan(St)( f(xt)) 7: xt+1 arg min xt+R e f(xt) f {constrained step}

8: St+1 St 9: φt+1 φt 10: else 11: vt LPsep D ( f(xt), φt, κ) 12: if vt = false then 13: xt+1 xt {dual step} 14: St+1 St 15: φt+1 φt/τ 16: else 17: xt+1 arg min xt+Rvt f {full step}

18: St+1 St {vt} 19: φt+1 φt 20: end if 21: end if 22: Optional: Correct St+1 23: end for

is simply taking the component of f(xt) parallel to span(St), which can be achieved by basic linear algebra and costs O(n card(St)2) when H = Rn. The line searches Lines 7 and 17 can be replaced with explicit step sizes using the smoothness of f (see Fact B.2 in the Appendix). The purpose of (the optional) Line 22 is to reoptimize the active set St+1, e.g., by reducing it to a subset that forms a basis for its linear span. One could also obtain further sparsity by removing atoms whose coefﬁcient in the decomposition of the iterate is smaller than some threshold δ > 0.

Blending. BMP aims at unifying the best of GMP and OMP. As seen in Section 2.2, an OMP iteration is a sequence of projected gradient (PG) steps. The idea is that the sequence of PG steps constituting an OMP iteration is actually overkill: there is a sweet spot where further optimizing over span(St) is less effective than adding a new atom and taking a GMP step into a (possibly) new space. However, PG steps have the beneﬁt of preserving sparsity, since no new atom is added. Furthermore, GMP steps require an expensive scan of the dictionary to output the descent direction v FW t arg minv D f(xt), v . To remedy this, BMP blends constrained steps (PG steps, Line 7) with full steps (laziﬁed GMP steps, Line 17) by promoting constrained steps as long as the progress in function value is comparable to that of a GMP step, else by taking a full step in an approximate direction vt (with cheap computation via Oracle 2) such that the progress is comparable to that of a GMP step. Therefore, to monitor this blending of steps, we wish to compare f(xt), v FW-S t and f(xt), vt to f(xt), v FW t , which quantities measure the progress in function value offered by a constrained step, a full step, and a GMP step respectively (see proofs in the Appendix).

Dual gap estimates. The aforementioned comparisons however cannot be made directly as the quantity f(xt), v FW t is (deliberately) not computed; computing it requires an expensive complete scan of the dictionary. Instead, we use an estimation of this quantity, by introducing the dual gap estimate |φt|. This designation comes from the fact that f(xt), v FW t is our equivalent of the duality gap from the constrained setting (see, e.g., Jaggi [2013]), and this will guide how we build our estimation. Indeed, since D is symmetric and assuming 0 int(conv(D )), there exists (an unknown)

ρ > 0 such that {x0, . . . , x T } arg min H f ρ conv(D ). Then for all x arg min H f,

ϵt := f(xt) f(x ) f(xt), xt x

max u,v ρ conv(D ) f(xt), u v = 2ρ f(xt), v FW t , (2)

which is our desired inequality. We set φ0 f(x0), v FW 0 /τ (Line 2) so ϵ0 2τρ|φ0| by (2). The criterion in Line 5 compares f(xt), v FW-S t to φt. If this quantity is below the threshold φt, then a constrained step is not taken and the weak-separation oracle (Line 11, Oracle 2) is called to search for an atom vt satisfying f(xt), vt φt. If the oracle cannot ﬁnd such an atom, then a full step is not taken and it returns a negative call with the certiﬁcate f(xt), v FW t > φt. In this case, BMP has detected an improved dual gap estimate and takes a dual step (Line 13): by (2), this implies that ϵt 2ρ|φt| so with φt+1 φt/τ and xt+1 xt, we recover ϵt+1 2τρ|φt+1|. Furthermore, observe that this update is a geometric rescaling which ensures that BMP requires only Ndual = O(ln 1/ϵ) dual steps (see proofs). Thus, the total number of negative calls, i.e., the number of iterations requiring a complete scan of the dictionary, is only O(ln 1/ϵ). Therefore, for this and for the blending of steps, the dual gap estimates |φt| are the key to the speed-up realized by BMP.

Parameters. BMP involves three (hyper-)parameters η > 0, κ 1, and τ > 1 to be set before running the algorithm. The parameter η needs to be tuned carefully, as its value affects the criterion in Line 5 to promote either speed of convergence (e.g., η 0.1, promoting full steps) or sparsity of the iterates (e.g., η 1000, promoting constrained steps). In our experiments (see Section 4 and the Appendix), we found that setting η 5 leads to close to both maximal speed of convergence and sparsity of the iterates, with the default choices κ = τ = 2. In this setting, BMP converges (much) faster than GMP and has iterates with sparsity very comparable to that of OMP, and therefore it is possible to enjoy both properties of speed and sparsity simultaneously. Note that the value of κ also impacts the range of values of η to which BMP is sensitive, since the criterion (Line 5) tests minv S t f(xt), v φt/η while the weak-separation oracle asks for v D such that f(xt), v φt/κ. In speciﬁc experiments, parameter tuning might further improve performance.

3.1 Convergence analyses

We start with the simpler case of smooth convex functions of order ℓ> 1 (Theorem 3.2). Our main result is Theorem 3.3, which subsumes the case of strongly convex functions. To establish the convergence rates of GMP and OMP, Locatello et al. [2017] assume knowledge of an upper bound on sup{ x D , x0 D , . . . , x T D } where D : x H 7 inf{ρ > 0 | x ρ conv(D )} is the atomic norm. In Locatello et al. [2018], this is resolved by working with the atomic norm D instead of the Hilbert space induced norm to, e.g., deﬁne smoothness and strong convexity of f and derive the proofs, but D itself can be difﬁcult to derive in many applications. In contrast, we need neither the ﬁniteness assumption nor to change the norm, however we assume f to be coercive to ensure feasibility of Problem (1), a reasonably mild assumption. Theorem 3.2 (Smooth convex case). Let D H be a dictionary such that 0 int(conv(D )) and let f : H R be smooth of order ℓ> 1, convex, and coercive. Then the Blended Matching Pursuit algorithm (Algorithm 3) ensures that f(xt) min H f ϵ for all t T where T = O (L/ϵ)1/(ℓ 1) .

We now present our main result in its full generality. We provide the general convergence rates of BMP (Algorithm 3) in Theorem 3.3. Recall that sharpness is implied by strong convexity and that it is a very mild assumption in ﬁnite dimensional spaces as it is satisﬁed by all well-behaved convex functions (Lemma 2.3). Theorem 3.3 (Smooth convex sharp case). Let D H be a dictionary such that 0 int(conv(D )) and let f : H R be L-smooth of order ℓ> 1, convex, coercive, and C-sharp of order θ ]0, 1/ℓ] on K. Then the Blended Matching Pursuit algorithm (Algorithm 3) ensures that f(xt) min H f ϵ for all t T where

O C1/(1 θ)L1/(ℓ 1) ln C|φ0|

Moreover, dist(xt, arg min H f) 0 as t + at same rate.

If f is not strongly convex then Locatello et al. [2017] only guarantee a sublinear convergence rate O(1/ϵ) for GMP and OMP, while Theorem 3.3 can still guarantee higher convergence rates, up to linear convergence O(ln 1/ϵ) if ℓθ = 1, using sharpness. Note that in the popular case of smooth strongly convex functions of orders ℓ= 2 and s = 2, Theorem 3.3 guarantees a linear convergence rate as these functions are sharp of order θ = 1/2 (with constant C = p

2/S) and thus satisfy ℓθ = 1. For completeness, we also study this special case in Appendix C, with a simpler proof. In conclusion, Theorem 3.3 extends linear convergence rates to a large class of non-strongly convex functions solving Problem (1).

Remark 3.4 (Optimality of the convergence rates). Let n + be the dimension of H. Nemirovskii and Nesterov [1985] provided unimprovable rates when solving Problem (1) in different cases. These optimal rates are reported in Table 1, where we compare them to those of BMP proved in this paper (Theorems 3.2 and 3.3). The third column gives the lower bounds on complexity stated in Nemirovskii and Nesterov [1985, Equations (1.20), (1.21 ), and (1.21)]. Note that our rates are dimension independent and hold globally across iterations. It remains an open question to determine whether the gap in the exponent can be closed by accelerating BMP.

Table 1: Comparison of the rates of BMP vs. the lower bounds on complexity.

Properties of f BMP rate Lower bound on complexity

Smooth convex T(ϵ) = O 1 ϵ1/(ℓ 1)

T(ϵ) = Ω min n, 1 ϵ1/(1.5ℓ 1)

Smooth convex sharp T(ϵ) = O ln 1

T(ϵ) = Ω min n, ln 1

with ℓ= 2, θ = 1/2

Smooth convex sharp T(ϵ) = O 1 ϵ(1 ℓθ)/(ℓ 1)

T(ϵ) = Ω min n, 1 ϵ(1 ℓθ)/(1.5ℓ 1)

with ℓθ < 1

4 Computational experiments

We implemented BMP in Python 3 along with GMP and OMP [Locatello et al., 2017], the Accelerated Matching Pursuit algorithm (acc MP) [Locatello et al., 2018], and the Blended Conditional Gradients (BCG) [Braun et al., 2019] and Conditional Gradient with Enhancement and Truncation (Co GEn T) [Rao et al., 2015] algorithms for completeness. All algorithms share the same code framework to ensure fair comparison. No enhancement beyond basic coding was performed. We ran the experiments on a laptop under Linux Ubuntu 18.04 with Intel Core i7 3.5GHz CPU and 8GB RAM. The random data are drawn from Gaussian distributions. For GMP, OMP, BCG, and Co GEn T, we represented the dual gaps by minv D f(xt), v , yielding a zig-zag plot dissimilar to the stair-like plot of the dual gap estimates |φt| of BMP. The Appendix contains additional experiments.

4.1 Comparison of BMP vs. GMP, OMP, BCG, and Co GEn T

Let H be the Euclidean space (Rn, , ) and D be the set of signed canonical vectors { e1, . . . , en}. Suppose we want to learn the (sparse) source x from observed data y := Ax +w, where A Rm n and where w N(0, σ2Im) is the noise in the observed y. The general and most intuitive formulation of the problem is minx Rn y Ax 2 2 s.t. x 0 x 0 =: s but the ℓ0-pseudo norm constraint is nonconvex and makes the problem NP-hard and therefore intractable in many situations [Natarajan, 1995]. To remedy this, one can handle the sparsity constraint in various ways, either by completely removing it and relying on an algorithm inherently promoting sparsity (as done in BMP, GMP, and OMP), or through a convex relaxation of the constraint, often via the ℓ1-norm, and then solving the new constrained convex problem minx Rn y Ax 2 2 s.t. x 1 x 1 (as done in BCG and Co GEn T). We ran a comparison of these methods, where we favorably provided the constraint x 1 x 1 for BCG and Co GEn T although x is unknown. We set m = 500, n = 2000, s = 100, and σ = 0.05. In BMP, we set κ = τ = 2 and we chose η = 5; see Appendix A.1 for an in-depth sensitivity analysis of BMP with respect to η. We did not perform any additional correction of the active sets (Line 22). Note that [Rao et al., 2015, Table III] demonstrated the superiority of

Co GEn T over Co Sa MP [Needell and Tropp, 2009], Subspace Pursuit [Dai and Milenkovic, 2009], and Gradient Descent with Sparsiﬁcation [Garg and Khandekar, 2009] on an equivalent experiment and we therefore do not compare to those methods.

Figure 1: Comparison of BMP vs. GMP, OMP, BCG, and Co GEn T, with η = 5.

Figure 1 shows that BMP is the fastest algorithm in wall-clock time and has close-to-optimal sparsity. It is important to stress that, unlike BCG and Co GEn T, BMP achieves this while having no explicit sparsity-promoting constraint, regularization, nor information on x . Thus, when x 1 is not provided, which is the case in most applications, BCG and Co GEn T would require a hyper-parameter tuning of the sparsity-inducing constraint (or, equivalently, the Lagrangian penalty parameters), such as the radius of the ℓ1-ball [Tibshirani, 1996], as used here, or the trace-norm-ball [Fazel et al., 2001]. OMP and Co GEn T converge faster per-iteration, as expected, given that they solve a reoptimization problem at each iteration, however this is very costly and the disadvantage becomes evident in wall-clock time performance. Note that another obvious choice for an algorithm would be projected gradient descent, however the provided sparsity is far from sufﬁcient (see Appendix A.2).

Figure 2: Comparison in NMSE of BMP vs. GMP, OMP, BCG, and Co GEn T, with η = 5.

In Figure 2, we compare the Normalized Mean Squared Error (NMSE) of the different methods. The NMSE at iterate xt is deﬁned as xt x 2 2/ x 2 2. The plots show a rebound occurring once the NMSE reaches 10 4, which is due to the algorithms overﬁtting to the noisy measurements y. A post-processing step can mitigate the rebound via early stopping or by removing atoms whose coefﬁcient in the decomposition of the iterate are smaller than some threshold δ > 0. We used early stopping on a validation set and present the test error ytest Atestx T 2 2/mtest on a test set in Table 2, where x T is the solution iterate for each algorithm. For completeness, we also reported the results for the Gradient Hard Thresholding Pursuit (Gra HTP) and Fast Gradient Hard Thresolding Pursuit (Fast Gra HTP) algorithms [Yuan et al., 2018], for which we favorably set k = x 0. As expected, GMP

performs the worst on the test set because its NMSE does not achieve sufﬁcient convergence (see Figure 2), highlighting the importance of a clean, i.e., sparse, decomposition into the dictionary D.

Table 2: Test error achieved using early stopping on a validation set.

Algorithm GMP OMP BMP BCG Co GEn T Gra HTP Fast Gra HTP

Test error 0.1917 0.0036 0.0037 0.0068 0.0043 0.0036 0.0037

The Appendix contains additional experiments on different objective functions: an arbitrarily chosen norm (Appendix A.3), the Huber loss (Appendix A.4), the distance to a convex set (Appendix A.5), and a logistic regression loss (Appendix A.6). The conclusions are identical.

4.2 Comparison of BMP vs. acc MP

Locatello et al. [2018] recently provided an Accelerated Matching Pursuit algorithm (acc MP) for solving Problem (1). We implemented the same code as theirs, using the exact same parametrization. The code framework matches the one we used for BMP. We ran BMP on their toy data example and compared the results against acc MP (which they labeled accelerated steepest in their plot); notice that we recovered their (per-iteration) plot exactly. The experiment is to minimize f : x R100 7 x b 2 2/2 over the linear span of D, where D is dictionary of 200 randomly chosen atoms in R100 and b R100 is also randomly chosen. The parameters, kindly provided by the authors of Locatello et al. [2018], for acc MP were L = 1000 and ν = 1. We report the results in Figure 3.

Figure 3: Comparison of BMP vs. acc MP, with η = 3.

We see that BMP outperforms acc MP in both speed of convergence and sparsity of the iterates. In fact, in terms of sparsity, acc MP needs to use all available atoms to converge while BMP needs only half as much. Furthermore, acc MP needs 75% of all available atoms to start converging signiﬁcantly while BMP starts to converge instantaneously. We suspect that this is due to the following: acc MP accelerates coordinate descent-like directions, which might be relatively bad approximations of the actual descent direction f(xt), whereas BMP is working directly with (the projection of) f(xt), achieving much more progress and offsetting the effect of acceleration.

5 Final remarks

We presented a Blended Matching Pursuit algorithm (BMP) which enjoys both properties of fast rate of convergence and sparsity of the iterates. More speciﬁcally, we derived linear convergence rates for a large class of non-strongly convex functions solving Problem (1), and we showed that our blending approach outperforms the state-of-the-art methods in speed of convergence while achieving close-to-optimal sparsity, and this without requiring sparsity-inducing constraints nor regularization. Although BMP already outperforms the Accelerated Matching Pursuit algorithm [Locatello et al., 2018] in our experiments, we believe it is also amenable to acceleration.

Acknowledgments

Research reported in this paper was partially supported by NSF CAREER award CMMI-1452463.

J. Bolte, A. Daniilidis, and A. Lewis. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4):1205 1223, 2007.

G. Braun, S. Pokutta, and D. Zink. Lazifying conditional gradient algorithms. In Proceedings of the 34th International Conference on Machine Learning, pages 566 575, 2017.

G. Braun, S. Pokutta, D. Tu, and S. Wright. Blended conditional gradients: the unconditioning of conditional gradients. In Proceedings of the 36th International Conference on Machine Learning, pages 735 743, 2019.

E. J. Candès and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12): 4203 4215, 2005.

S. Chen, S. A. Billings, and W. Luo. Orthogonal least squares methods and their application to non-linear system identiﬁcation. International Journal of Control, 50(5):1873 1896, 1989.

L. Condat. Fast projection onto the simplex and the ℓ1 ball. Mathematical Programming, 158(1):575 585, 2016.

W. Dai and O. Milenkovic. Subspace pursuit for compressive sensing signal reconstruction. IEEE Transactions on Information Theory, 55(5):2230 2249, 2009.

M. A. Davenport and M. B. Wakin. Analysis of orthogonal matching pursuit using the restricted isometry property. IEEE Transactions on Information Theory, 56(9):4395 4401, 2010.

G. Davis, S. Mallat, and Z. Zhang. Adaptive time-frequency decompositions with matching pursuits. Optical Engineering, 33(7):2183 2191, 1994.

M. Fazel, H. Hindi, and S. P. Boyd. A rank minimization heuristic with application to minimum order system approximation. In Proceedings of the American Control Conference, pages 4734 4739, 2001.

M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2): 95 110, 1956.

R. Garg and R. Khandekar. Gradient descent with sparsiﬁcation: An iterative algorithm for sparse recovery with restricted isometry property. In Proceedings of the 26th International Conference on Machine Learning, pages 337 344, 2009.

R. Gribonval and P. Vandergheynst. On the exponential convergence of matching pursuits in quasi-incoherent dictionaries. IEEE Transactions on Information Theory, 52(1):255 261, 2006.

I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror. Result analysis of the NIPS 2003 feature selection challenge. In Advances in Neural Information Processing Systems 17, pages 545 552. 2005.

P. J. Huber. Robust estimation of a location parameter. Annals of Mathematical Statistics, 35(1):73 101, 1964.

M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, pages 427 435, 2013.

T. Kerdreux, A. d Aspremont, and S. Pokutta. Restarting Frank-Wolfe. In Proceedings of the 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 1275 1283, 2019.

G. Lan, S. Pokutta, Y. Zhou, and D. Zink. Conditional accelerated lazy stochastic gradient descent. In Proceedings of the 34th International Conference on Machine Learning, pages 1965 1974, 2017.

E. S. Levitin and B. T. Polyak. Constrained minimization methods. USSR Computational Mathematics and Mathematical Physics, 6(5):1 50, 1966.

F. Locatello, R. Khanna, M. Tschannen, and M. Jaggi. A uniﬁed optimization view on generalized matching pursuit and Frank-Wolfe. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, pages 860 868, 2017.

F. Locatello, A. Raj, S. P. Karimireddy, G. Rätsch, B. Schölkopf, S. U. Stich, and M. Jaggi. On matching pursuit and coordinate descent. In Proceedings of the 35th International Conference on Machine Learning, pages 3198 3207, 2018.

S. Łojasiewicz. Une propriété topologique des sous-ensembles analytiques réels. In Les Équations aux Dérivées Partielles, 117, pages 87 89. Colloques Internationaux du CNRS, 1963.

S. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397 3415, 1993.

B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227 234, 1995.

D. Needell and J. A. Tropp. Co Sa MP: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301 321, 2009.

A. S. Nemirovskii and Y. E. Nesterov. Optimal methods of smooth convex minimization. USSR Computational Mathematics and Mathematical Physics, 25(2):21 30, 1985.

H. Nguyen and G. Petrova. Greedy strategies for convex optimization. Calcolo, 54(1):207 224, 2017.

Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27th Asilomar Conference on Signals, Systems, and Computers, pages 40 44, 1993.

N. Rao, S. Shah, and S. Wright. Forward-backward greedy algorithms for atomic norm regularization. IEEE Transactions on Signal Processing, 63(21):5798 5811, 2015.

V. Roulet and A. d Aspremont. Sharpness, restart and acceleration. In Advances in Neural Information Processing Systems 30, pages 1119 1129. 2017.

S. Shalev-Shwartz, N. Srebro, and T. Zhang. Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM Journal on Optimization, 20:2807 2832, 2010.

V. Temlyakov. Chebushev greedy algorithm in convex optimization. ar Xiv preprint ar Xiv:1312.1244, 2013.

V. Temlyakov. Greedy algorithms in convex optimization on Banach spaces. In Proceedings of the 48th Asilomar Conference on Signals, Systems, and Computers, pages 1331 1335, 2014.

V. Temlyakov. Greedy approximation in convex optimization. Constructive Approximation, 41(2):269 296, 2015.

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267 288, 1996.

R. J. Tibshirani. A general framework for fast stagewise algorithms. Journal of Machine Learning Research, 16 (1):2543 2588, 2015.

J. A. Tropp. Greed is good: algorithmic results for sparse approximation. IEEE Transactions on Information Theory, 50(10):2231 2242, 2004.

Q. Yao and J. T. Kwok. Greedy learning of generalized low-rank model. In Proceedings of the 25th International Joint Conference on Artiﬁcial Intelligence, pages 2294 2300, 2016.

X.-T. Yuan, P. Li, and T. Zhang. Gradient hard thresholding pursuit. Journal of Machine Learning Research, 18 (166):1 43, 2018.

T. Zhang. On the consistency of feature selection using greedy least squares regression. Journal of Machine Learning Research, 10:555 568, 2009.