# local_identification_of_overcomplete_dictionaries__b871bae4.pdf

Journal of Machine Learning Research 16 (2015) 1211-1242 Submitted 1/14; Revised 7/14; Published 6/15

Local Identiﬁcation of Overcomplete Dictionaries

Karin Schnass karin.schnass@uibk.ac.at Department of Mathematics University of Innsbruck Technikerstraße 13 6020 Innsbruck, Austria

Editor: Shie Mannor

This paper presents the ﬁrst theoretical results showing that stable identiﬁcation of overcomplete µ-coherent dictionaries Φ Rd K is locally possible from training signals with sparsity levels S up to the order O(µ 2) and signal to noise ratios up to O(

d). In particular the dictionary is recoverable as the local maximum of a new maximization criterion that generalizes the K-means criterion. For this maximization criterion results for asymptotic exact recovery for sparsity levels up to O(µ 1) and stable recovery for sparsity levels up to O(µ 2) as well as signal to noise ratios up to O(

d) are provided. These asymptotic results translate to ﬁnite sample size recovery results with high probability as long as the sample size N scales as O(K3d S ε 2), where the recovery precision ε can go down to the asymptotically achievable precision. Further, to actually ﬁnd the local maxima of the new criterion, a very simple Iterative Thresholding and K (signed) Means algorithm (ITKM), which has complexity O(d KN) in each iteration, is presented and its local eﬃciency is demonstrated in several experiments. Keywords: dictionary learning, dictionary identiﬁcation, sparse coding, sparse component analysis, vector quantization, K-means, ﬁnite sample size, sample complexity, maximization criterion, sparse representation

1. Introduction

Be it the 300 million photos uploaded to Facebook per day, the 800GB the large Hadron collider records per second or the 320.000GB per second it cannot record, it is clear that we have reached the age of big data. Indeed, in 2012, the amount of data existing worldwide is estimated to have reached 2.8 ZB = 2.800 billion GB and while 23 % of these data are expected to be useful if analyzed, only 1% actually are. So how do we deal with this big data challenge? The key concept, that has driven data processing and data analysis in the past decade, is that even high-dimensional data has intrinsically low complexity, meaning that every data point y can be represented as linear combination of a sparse (small) number of elements or atoms φi Rd, φ 2 = 1 of an overcomplete dictionary Φ = (φ1, . . . φK), that is, y ΦIx I = X

i I x(i)ϕi,

for a set I of size S, |I| = S, which is small compared to the ambient dimension, S d K. These sparse components do not only describe the data but the representations can also be

c 2015 Karin Schnass.

used for a myriad of eﬃcient sparsity based data processing schemes, ranging from denoising (Donoho et al., 2006) to compressed sensing (Donoho, 2006; Cand es et al., 2006). Therefore a promising tool both for data analysis and data processing, that has emerged in the last years, is dictionary learning, also known as sparse coding or sparse component analysis. Dictionary learning addresses the fundamental question of how to automatically learn a dictionary, providing sparse representations for a given data class. That is, given N signals yn Rd, stored as columns in a matrix Y = (y1, . . . , y N), ﬁnd a decomposition

into a d K dictionary matrix Φ with unit norm columns and a K N coeﬃcient matrix with sparse columns. Until recently the main research focus in dictionary learning has been on the development of algorithms. Thus by now there is an ample choice of learning algorithms, that perform well in experiments and are popular in applications (Field and Olshausen, 1996; Kreutz Delgado and Rao, 2000; Kreutz-Delgado et al., 2003; Aharon et al., 2006; Yaghoobi et al., 2009; Mairal et al., 2010; Skretting and Engan, 2010; Rubinstein et al., 2010). However, slowly the interest is shifting and researchers are starting to investigate also the theoretical aspects of dictionary learning. Following the ﬁrst theoretical insights, originating in the blind source separation community (Zibulevsky and Pearlmutter, 2001; Georgiev et al., 2005), there is now a set of generalization bounds predicting how well a learned dictionary can be expected to sparsely approximate future data (Maurer and Pontil, 2010; Vainsencher et al., 2011; Mehta and Gray, 2012; Gribonval et al., 2013). These results give a theoretical foundation for dictionary learning as data processing tool, for example for compression, but unfortunately do not give guarantees that an eﬃcient algorithm will ﬁnd/recover a good dictionary provided that it exists. However, in order to justify the use of dictionary learning as data analysis tool, for instance in blind source separation, it is important to provide conditions under which an algorithm or scheme can identify the dictionary from a ﬁnite number of training signals, that is, the sources from the mixtures. Following the ﬁrst dictionary identiﬁcation results for the ℓ1-minimization principle (Gribonval and Schnass, 2010; Geng et al., 2011; Jenatton et al., 2014), which was suggested by Zibulevsky and Pearlmutter (2001)/Plumbley (2007), and for the ER-SPu D algorithm for learning a basis (Spielman et al., 2012), 2013 has seen a number of interesting developments. First it was shown that the K-SVD minimization principle suggested by Aharon et al. (2006) can locally identify overcomplete tight dictionaries (Schnass, 2014). Later algorithms with global identiﬁcation guarantees for coherent dictionaries were presented (Arora et al., 2014; Agarwal et al., 2014b). Finally it was shown that an alternating minimization method is locally convergent to the correct generating dictionary (Agarwal et al., 2014a). One aspect that all these results have in common is that the sparsity level of the training signals required for successful identiﬁcation is of order O(µ 1) or O(

d) for incoherent dictionaries. Considering that on average sparse recovery in a given dictionary is successful for sparsity levels O(µ 2) (Tropp, 2008; Schnass and Vandergheynst, 2007) and that for dictionary learning we usually have a lot of training signals at our disposal, the same sparsity level should be suﬃcient for dictionary learning and indeed in this paper we provide the ﬁrst indication that global dictionary identiﬁcation could be possible for sparsity levels O(µ 2) by proving that it is locally possible. Further we show that in experiments a very simple iterative

Local Identification of Overcomplete Dictionaries

algorithm, based on thresholding and K signed means, is locally successful. The paper is organized as follows. After introducing all necessary notation in Section 2 we present a new optimization criterion, motivated by the analysis of the K-SVD principle (Schnass, 2014) in Section 3. In Section 4 we give asymptotic identiﬁcation results both for exact and stable recovery, which in Section 5 are extended to results to ﬁnite sample sizes. Section 6 provides an algorithm for actually ﬁnding a local optimum and some experiments conﬁrming the theoretical results. Finally in the last section we compare the results for the new criterion to existing identiﬁcation results, discuss the implications of these local results for global dictionary identiﬁcation algorithms and point out directions for future research.

2. Notations and Conventions

Before we jump into the fray, we collect some deﬁnitions and lose a few words on notations; usually subscripted letters will denote vectors with the exception of c and ε where they are numbers, e.g., (x1, . . . , x K) = X Rd K vs. c = (c1, . . . , c K) RK, however, it should always be clear from the context what we are dealing with. We consider a dictionary Φ a collection of K unit norm vectors φi Rd, φi 2 = 1. By abuse of notation we will also refer to the d K matrix collecting the atoms as its columns as the dictionary, that is Φ = (φi, . . . φK). The maximal absolute inner product between two diﬀerent atoms is called the coherence µ of the dictionary, µ = maxi =j | φi, φj |. By ΦI we denote the restriction of the dictionary to the atoms indexed by I, that is ΦI = (φi1 . . . φi S), ij I. We indicate the conjugate transpose of a matrix with a , for example Φ would be the transpose of Φ. The set of all dictionaries of a given size (d K) is denoted by D. For two dictionaries Φ, Ψ D we deﬁne the distance between each other as the maximal distance between two corresponding atoms,

d(Φ, Ψ) := max i φi ψi 2.

We consider a frame F a collection of K d vectors fi Rd for which there exist two positive frame constants A, B such that for all v Rd we have

i=1 | fi, v |2 B v 2 2. (1)

From (1) it follows that F, interpreted as d K matrix, has rank d and that its non-zero singular values are in the interval [

B]. If B can be chosen equal to A, that is B = A, the frame is called tight. If all frame elements fi have unit norm, we call F a unit norm frame. For more details on frames see for instance the introduction by Christensen (2003). Finally we introduce the Landau symbols O, o to characterise the growth of a function. We write

f(t) = O(g(t)) if lim t 0/ f(t)/g(t) = C <

and f(t) = o(g(t)) if lim t 0/ f(ε)/g(ε) = 0.

3. A Response Maximization Criterion

One of the origins of dictionary learning can be found in the ﬁeld of vector quantization, where the aim is to ﬁnd a codebook (dictionary) such that the codewords (atoms) closely represent the data, that is

min Φ,X Y ΦX 2 F s.t. xn {ei}i.

Indeed the vector quantization problem can be seen as an extreme case of dictionary learning, where we do not only want all our signals to be approximately 1-sparse but also want the single non-zero coeﬃcient equal to one. On the other hand we allow the atoms (codewords) to have any length. The problem above is usually solved by a K-means algorithm, which alternatively separates the training data into K clusters, each assigned to one codeword, and then updates the codeword to be the mean of the associated train signals. For more detailed information about vector quantization or the K-means algorithm see for instance the book by Gersho and Gray (1992) or the introduction by Aharon et al. (2006). If we relax the condition that each coeﬃcient has to be positive, but in turn ask for the atoms to have unit norm, we are already getting closer to the concept of 1-sparse dictionary learning,

min Φ D,X Y ΦX 2 F s.t. xn { ei}i.

This minimization problem can be rewritten as

n min i,σi= 1 yn σiφi 2 2 = min Φ D

n min i,σi= 1 yn 2 2 2σi yn, φi + φi 2 2

= Y 2 F + N 2 max Φ D

n max i | yn, φi |,

and is therefore equivalent to the maximization problem

n max i | yn, φi |. (2)

A local maximum of (2) can be found with a signed K-means algorithm, which assigns each training signal to the atom of the current dictionary giving the largest response in absolute value and then updating the atom as normalized signed mean of the associated training signals, see Section 6 for more details. The question now is how do we go from these 1-sparse dictionary learning formulations to S-sparse formulations with S > 1. The most common generalization, which provides the starting point for the MOD and the KSVD algorithm, is to give up all constraints on the coeﬃcients except for S-sparsity and to minimize,

(PP2) min Φ D,X Y ΦX 2 F s.t. xn 0 S. (3)

However, rewriting the problem we see that this formulation does not reduce to the same maximization problem in case S = 1. Then the best one term approximation in the given

Local Identification of Overcomplete Dictionaries

dictionary is simply the largest projection onto one atom and we have

n min i,xi yn xiφi 2 2 = min Φ D

n min i yn φi, yn φi 2 2

= Y 2 F max Φ D

n max i | yn, φi |2,

leading instead to the maximization problem

n max i | yn, φi |2 vs. max Φ D

n max i | yn, φi |.

A local maximum can now be found using the same partitioning strategy as before but updating the atoms as largest singular vector rather than signed mean of the associated training signals, requiring K SVDs as opposed to K means. While the minimization problem (3) is deﬁnitely the most eﬀective generalization for dictionary learning when the goal is compression, it brings with it some complications when used as analysis tool. Indeed it has been shown (Schnass, 2014) that for S = 1 the K-SVD criterion (3) can only identify the underlying dictionary from sparse random mixtures to arbitrary precision (given enough training samples) if this dictionary is tight and it is conjectured that the same holds for S 1. Roughly simpliﬁed the reason for this is that for random sparse signals ΦIx I and an ε-perturbation Ψ the average of the largest squared response behaves like

2 + c(Ψ) 2 + 1

2 c(Ψ) 2 = 1 ε2 + ε4

If Φ is tight the term c(Ψ) is constant over all dictionaries and therefore there is a local maximum at Φ. From the above we also see that the average of the largest absolute response should behave like

2 + c(Ψ) + 1

2 c(Ψ) = 1 ε2

meaning that we should have a maximum at Φ also if it is non-tight. This suggests as alternative way to generalize the K-means optimization principle for dictionary identiﬁcation to simply maximize the absolute norm of the S-largest responses,

(PR1) max Ψ D

n max |I|=S Ψ Iyn 1. (4)

Other than for the K-SVD criterion it is not obvious that there should be a local optimum of (4) at Φ even if all signals yn are perfectly S-sparse in Φ. Therefore it is quite intriguing that we will not only be able to prove local identiﬁability of any generating dictionary via (4) from randomly sparse signals, but that these identiﬁability properties are stable under coherence and noise. However, before we get to the main result in Theorem 5 on page 1221, we ﬁrst have to lay the foundation, by providing suitable random signal models and by studying the asymptotic identiﬁability properties of the new principle.

4. Asymptotic Results

We can get to the asymptotic version of the S-response maximization principle in (4) simply by replacing the sum over the training signals with the expectation, leading to

max |I|=S Ψ Iy 1

Next we need a random sparse coeﬃcient model to generate our signals y. We make the following deﬁnition, see also Schnass (2014).

Deﬁnition 1 A probability distribution (measure) ν on the unit sphere SK 1 RK is called symmetric if for all measurable sets X SK 1, for all sign sequences σ { 1, 1}K

and all permutations p we have

ν(σX) = ν(X), where σX := {(σ1x1, . . . , σKx K) : x X}, and

ν(p(X)) = ν(X), where p(X) := {(xp(1), . . . , xp(K)) : x X}.

Setting y = Φx where x is drawn from a symmetric probability measure ν on the unit sphere has the advantage that for dictionaries which are orthonormal bases the resulting signals have unit norm and for general dictionaries the signals have unit square norm in expectation, that is E( y 2 2) = 1. This reﬂects the situation in practical application, where we would normalize the signals in order to equally weight their importance. One example of such a probability measure can be constructed from a non-negative, nonincreasing sequence c RK with c 2 = 1, which we permute uniformly at random and provide with random signs. To be precise for a permutation p : {1, ..., K} {1, ..., K} and a sign sequence σ, σi = 1, we deﬁne the sequence cp,σ component-wise as cp,σ(i) := σicp(i), and set ν(x) = (2KK!) 1 if there exist p, σ such that x = cp,σ and ν(x) = 0 otherwise. While being very simple this measure exhibits all the necessary structure and indeed in our proofs we will reduce the general case of a symmetric measure to this simple case. So far we have not incorporated any sparse structure in our coeﬃcient distribution. To motivate the sparsity requirements on our coeﬃcients we will recycle the simple negative example of a sparse coeﬃcient distribution for which the original generating dictionary is not at a local maximum of (5) with S = 1 (Schnass, 2014).

Example 1 Let U be an orthonormal basis and let the signals be constructed as y = Φx. If x is randomly 2-sparse with ﬂat coeﬃcients, that is, drawn from the simple symmetric probability measure with base sequence c, where c1 = c2 = 1/

2, ci = 0 for i 3, then U is not a local maximum of (5) with S = 1. Indeed, since the signals are all 2-sparse, the maximal inner product with all atoms in U is the same as the maximal inner product with only d 1 atoms. This degree of freedom we can use to construct an ascent direction. Choose Uε = (u1, . . . , ud 1, (ud + εu1)/

Local Identification of Overcomplete Dictionaries

then we have

Ey ( U ε y ) = Ex

x1, . . . , xd 1, xd + εx1

2, xd + εx1

1 1 d(d 1) + 1 d(d 1) 1 + ε

1 + 1 d(d 1) ε ε2

2 = Ey ( U y ) .

From the above example we see that, in order to have a local maximum of (5) with S = 1 at the original dictionary, we need our signals to be truly 1-sparse, that is, we need to have a decay between the ﬁrst and the second largest coeﬃcient. In the following sections we will study how large this decay should be to have a local maximum exactly at or near to the generating dictionary for more general dictionaries and sparsity levels.

4.1 Exact Recovery

To warm up we ﬁrst provide an asymptotic exact dictionary identiﬁcation result for (5) for incoherent dictionaries in the noiseless setting.

Theorem 2 Let Φ be a unit norm frame with frame constants A B and coherence µ. Let the coeﬃcients x be drawn from a symmetric probability distribution ν on the unit sphere SK 1 RK and assume that the signals are generated as y = Φx. If there exists β > 0 such that for c1(x) c2(x) . . . c K(x) 0 the non-increasing rearrangement of the absolute values of the components of x we have c S(x) c S+1(x) 2µ x 1 β almost surely, that is

ν (c S(x) c S+1(x) 2µ x 1 β) = 1, (6)

then there is a local maximum of (5) at Φ. Moreover for Ψ = Φ we have Ey max|I|=S Ψ Iy 1 < Ey max|I|=S Φ Iy 1 as soon as

B β( c1+...+ c S) , (7)

where ci := Ex(ci(x)).

Proof idea We brieﬂy sketch the main ideas of the proof, which are the same as for the corresponding theorem for the K-SVD principle (Schnass, 2014). For self-containedness of the paper the full proof is included in Appendix A.1. Assume that we have the case of a simple probability measure based on one sequence c, that is x = cp,σ. For any ﬁxed permutation p the condition in (6) ensures that for all sign sequences σ, and consequently all signals, the maximal S responses of the original dictionary Φ are attained at Ip = p 1 ({1 . . . S}) and that there is a gap of size β to the remaining responses.

For an ε-perturbation of the generating dictionary we have ψi (1 ε2 i /2)φi +εizi for some unit vectors zi with zi, φi = 0 and εi ε. Now for most sign sequences the contribution of εizi to the response ψi, Φcp,σ will be smaller than β/2 so the maximal S responses will still be attained at Ip. Comparing the loss of the perturbed dictionary over the typical sign sequences of all permutations, which scales as (c1+...+c S)

2K P ε2 i , to the maximal gain Sε

B over the approximately 2 P i exp β2/ε2 i atypical sign sequences shows that there is a maximum at the original dictionary. The general result follows from an integration argument.

As already mentioned, while for the K-SVD criterion (3) there is always an optimum at the generating dictionary if all training signals are S-sparse, this it is not obvious for the response principle. Indeed, in the special case where all the training signals are exactly S-sparse, c S+1(x) = 0 almost surely, we get an additional condition to ensure asymptotic recoverability,

s=1 cs(x) β > 0, almost surely.

To get a better feeling for this constraint we bound the sum over the S largest responses by S times the largest response, PS s=1 cs(x) Sc1(x) and arrive at the condition

c1(x) 2Sµ, (8)

which is the classical condition under which simple thresholding will ﬁnd the support of an exactly S-sparse signal (Schnass and Vandergheynst, 2008).

4.2 Stability under Coherence and Noise

While giving a ﬁrst insight into the identiﬁcation properties of the response principle, Theorem 2 suﬀers from two main limitations. First, the required condition on the coherence of the dictionary with respect to the decay of the coeﬃcients, c S(x) c S+1(x) 2µ c(x) 1 > 0, is unfortunately quite strict. In the most favourable case of exactly S sparse signals with equally sized coeﬃcients, c1(x) = c S(x) = 1/

S, we see from (8) that we can only identify dictionaries from very sparse signals, where S = µ 1. In case of very incoherent dictionaries with µ = O(1/

d) this means that S

d. However, for most sign sequences σ we have

| φi, Φcp,σ | = σicp(i) + X

j =i σjcp(j) φi, φj cp(i)

j =i c2 p(j)| φi, φj |2

which indicates that a condition of the form µ c S c S+1 may be strong enough to guarantee (approximate) recoverability of the dictionary. Assuming again the most favourable case of equally sized coeﬃcients, we could therefore identify dictionaries from signals with sparsity levels of the order S µ 2, which, in case of incoherent dictionaries, means of the order of the ambient dimension S d.

Local Identification of Overcomplete Dictionaries

The second limitation of Theorem 2 is that, even if it allows for not exact S-sparseness of the signals, it does not take into account noise. Our next goal is therefore to extend the exact identiﬁcation result in Theorem 2 to a stable identiﬁcation result for less sparse (larger S) and noisy signals. For this task we ﬁrst need to amend our signal model to incorporate noise. We would like to consider unbounded white noise, but also keep the property that in expectation the signals have unit square norm. Further for the next section, where we want to transform our asymptotic identiﬁcation results to results for ﬁnite sample sizes, it will be convenient if our signals are bounded. These considerations lead to the following model:

y = Φx + r p

1 + r 2 2 , (9)

where r = (r(1) . . . r(d)) is a centred random subgaussian vector with parameter ρ. That is, the entries r(i) are independent and satisfy E(et r(i)) et2ρ2/2. Employing this noisy signal model and formalizing the ideas about the typical gap size between responses of the generating dictionary inside and outside the true support, leads to the following theorem.

Theorem 3 Let Φ be a unit norm frame with frame constants A B and coherence µ. Let the coeﬃcients x be drawn from a symmetric probability distribution ν on the unit sphere SK 1 RK. Further let r = (r(1) . . . r(d)) be a centred random subgaussian noise-vector with parameter ρ and assume that the signals are generated according to the noisy signal model in (9). If there exists β > 0 such that for c1(x) c2(x) . . . c K(x) 0 the non-increasing rearrangement of the absolute values of the components of x we have c S(x) c S+1(x) β almost surely and

max{µ, ρ} β p

72(log a + log log a) for a = 112K2S(

B + 1) Crβ( c1 + . . . + c S), (10)

where Cr = Er (1 + r 2 2) 1/2 and ci := Ex(ci(x)), then there is a local maximum of (5) at Ψ satisfying

d( Ψ, Φ) 12SK2

B Cr( c1 + . . . + c S) exp β2

72 max{µ2, ρ2}

Proof idea As outlined at the beginning of the section the main ingredient we have to add to the proof idea of Theorem 2 is a probabilistic argument to substitute the condition guaranteeing that the S largest responses of the generating dictionary are Ip. Due to concentration of measure we get that for most sign sequences, and therefore most signals, the maximum is still attained at Ip. Moreover the gap to the remaining responses is actually large enough to accommodate relatively high levels of noise and/or perturbations. The detailed proof can be found in Appendix A.2.

Let us make some observations about the last result. First, we want to point out that for sub-Gaussian noise with parameter ρ, the quantity Cr = Er (1 + r 2 2) 1/2 in the statement above is well behaved. If for example the r(i) are iid Bernoulli-variables, that is P(r(i) = ρ) = 1

2, we have Cr = (1+dρ2) 1/2. In general

we have the following estimate due for instance to Theorem 1 by Hsu et al. (2012). Since we have

P x 2 2 ρ2(d + 2

dt + 2t e t,

setting t = d, we get P x 2 2 5dρ2 e d, which leads to

Also to illustrate the result we again specialize it to the most favourable case of exactly Ssparse signals with balanced coeﬃcients, that is c S(x) = S 1/2. Assuming white Gaussian noise with variance ρ2 G we see that identiﬁcation is possible even for expected signal to noise ratios of the order O(S

d ), that is E( Φx 2 2) E( r 2 2) S

Similarly, by specializing Theorem 3 to the case of exactly S-sparse and noiseless signals we get - to the best of our knowledge - the ﬁrst result establishing that locally it is possible to stably identify dictionaries from signals with sparsity levels beyond the spark of the generating dictionary. Indeed, even if some of the S-sparse signals could have representations in Φ that require less than S atoms, there will still be a local maximum of the asymptotic criterion close to the original dictionary as long as the smallest coeﬃcient of each signal is of the order O(µ), which in the most favourable case means that we can have S µ 2 or S d. The quality of this result is on a par with the best results for ﬁnding sparse approximations in a given dictionary, which say that on average Basis Pursuit or thresholding can ﬁnd the correct sparse support even for signals with sparsity levels of the order of the ambient dimension (Tropp, 2008; Schnass and Vandergheynst, 2007). Next note that with the available tools it would be possible to consider also a signal model where a small fraction of the coeﬃcients violates the decay condition c S(x) c S+1(x) β and still have stability. However, we leave explorations in that direction to the interested reader and instead turn to the study of the criterion for a ﬁnite number of training samples.

5. Finite Sample Size Results

In this section we will transform the two asymptotic results from the last section into results for ﬁnite sample sizes, that is, we will study when Φ is close to a local maximum of

max Ψ D 1 N

n=1 max |I|=S Ψ Iyn 1, (12)

assuming that the yn are following either the noise-free or the noisy signal model. For convenience we will do the analysis for the normalized version (12) of the S-response criterion (4).

Theorem 4 Let Φ be a unit norm frame with frame constants A B and coherence µ. Let the coeﬃcients xn be drawn from a symmetric probability distribution ν on the unit sphere

Local Identification of Overcomplete Dictionaries

SK 1 RK and assume that the signals are generated as yn = Φxn. If there exists β > 0 such that for c1(xn) c2(xn) . . . c K(xn) 0 the non-increasing rearrangement of the absolute values of the components of xn we have c S(xn) c S+1(xn) 2µ xn 1 β almost surely and the target precision ε satisﬁes

B β( c1+...+ c S) ,

where ci := Exn(ci(xn)), then except with probability,

N ε2( c1 + . . . + c S)2

129S2K2B + Kd log

B ε( c1 + . . . + c S)

there is a local maximum of (12) respectively (4) at Ψ satisfying

d( Ψ, Φ) ε + ε2

Theorem 5 Let Φ be a unit norm frame with frame constants A B and coherence µ. Let the coeﬃcients xn be drawn from a symmetric probability distribution ν on the unit sphere SK 1 RK. Further let rn = (rn(1) . . . rn(d)) be i.i.d. centred random subgaussian noisevectors with parameter ρ and assume that the signals are generated according to the noisy signal model in (9). If there exists β > 0 such that for c1(xn) c2(xn) . . . c K(xn) 0 the non-increasing rearrangement of the absolute values of the components of x we have c S(xn) c S+1(xn) β almost surely and if the target precision ε, the noise parameter ρ and the coherence µ satisfy

ε β 9 4 + 9 log a and (13)

max{µ, ρ} β p

72(log a + log log a) for a = 150K2S(

B + 1) Crβ( c1 + . . . + c S),

where Cr = Ern (1 + rn 2 2) 1/2 and ci := Exn(ci(xn)), then except with probability

N ε2 µ,ρ( c1 + . . . + c S)2

513C2r S2K2

B + 1 2 + Kd log

εµ,ρ( c1 + . . . + c S)

where εµ,ρ = max

Cr( c1 + . . . + c S) exp β2

72 max{µ2, ρ2}

there is a local maximum of (12) respectively (4) at Ψ, satisfying

d( Ψ, Φ) εµ,ρ + ε2 µ,ρ 16K .

Proof idea The proofs, which can be found in Appendix A.3, are based on three ingredients, a Lipschitz property for the mapping Ψ 1

N PN n=1 max|I|=S Ψ Iyn 1 for the respective signal model, the concentration of the sum around its expectation for a δ-net covering the space of all admissible dictionaries close to Φ and a triangle inequality argument to show that the ﬁnite sample response diﬀerences are close to the expected response diﬀerences and therefore larger than 0 for all ε εµ,ρ.

To see better how the sample complexity behaves, we simplify the two theorems to the special case of noiseless exactly S-sparse signals with balanced coeﬃcients for various orders of magnitude of S. If we have S = O(1), Theorem 4 implies that in order to have a maximum within radius ε to the original dictionary Φ with probability e Kd we need N = O K3d ε 2 samples. Conversely given N training signals we can expect the distance between generating dictionary and closest local maximum to be of the order O K2N 1/2 . If we assume a very incoherent dictionary where µ = O(d 1/2) and thus let the sparsity level be of the order O(

d) the sample complexity rises to N = O K3d3/2 ε 2 . Taking into account that by (13) the target precision ε needs to be of order O S 1/2 = O d 1/4

this means that we need at least N = O K3d2 training signals and once this initial level is reached, the error goes to zero at rate N 1/2. For an even lower sparsity level, S = O(d), again assuming a very incoherent dictionary, the sample complexity for target precision ε implied now by Theorem 5 rises to N = O K3d2 ε 2 . In this regime, however, we cannot reach arbitrarily small errors by choosing N large enough but only approach the asymptotic precision εµ = 16K2

SB exp ( d/72S). Following these promising theoretical results, in the next section we will ﬁnally see how theory translates into practice.

6. Experiments

After showing that the optimization criterion in (4) is locally suitable for dictionary identiﬁcation, in this section we present an iterative thresholding and K means type algorithm (ITKM) to actually ﬁnd the local maxima of (4) and conduct some experiments to illustrate the theoretical results. We recall that given the input signals Y = (y1 . . . y N) and a ﬁxed sparsity parameter S we want to solve,

n max |I|=S Ψ Iyn 1.

Using Lagrange multipliers,

n max |I|=S Ψ Iyn 1

n:k I(Ψ,yn) sign( ψk, yn )y n,

ψk 2 2 = 2ψ k,

Local Identification of Overcomplete Dictionaries

where I(Ψ, yn) := argmax|I|=S Φ Iyn 1, we arrive at the following update rule,

ψnew k = λk X

n:k I(Ψold,yn) sign( ψold k , yn )yn, (14)

where λk is a scaling parameter ensuring that ψnew k 2 = 1. In practice, when we do not have an oracle giving us the generating dictionary as initialization, we also need to safeguard against bad initializations resulting in a zero-update ψnew k = 0. For example we can choose the zero-updated atom uniformly at random from the unit sphere or from the input signals. Note that ﬁnding the sets I(Ψold, yn) corresponds to N thresholding operations while updating according to (14) corresponds to K signed means. Altogether this means that each iteration of ITKM has computational complexity determined by the matrix multiplication Ψ Y , meaning O(d KN). This is light in comparison to K-SVD, which even when using thresholding instead of OMP as sparse approximation procedure still requires the calculation of the maximal singular vector of K on average d N

K matrices. It is also more computationally eﬃcient than the algorithm for local dictionary reﬁnement, proposed by Arora et al. (2014), which is also based on averaging. Furthermore it is straightforward to derive online or parallelized versions of ITKM. In an online version for each newly arriving signal yn we calculate I(Ψold, yn) using thresholding and update ψnew k = ψnew k + sign( ψold k , yn )yn for k I(Ψold, yn). After N signals have been processed we renormalize the atoms ψnew k to have unit norm and set Ψold = Ψnew. Similarly, to parallelize we divide the training samples into m sets of size N

m. Then on each node m we learn a dictionary Ψnew m according to (14) with λk = 1. We then calculate the sum of these dictionaries Ψnew 0 = P m Ψnew m and renormalize the atoms in Ψnew 0 to have unit norm. Armed with this very simple algorithm we will now conduct four experiments to illustrate our theoretical ﬁndings1.

6.1 ITKM vs. K-SVD

In our ﬁrst experiment we compare the local recovery error of ITKM and K-SVD for 3dimensional bases with increasing condition numbers. The bases are perturbations of the canonical basis Φ = (e1, e2, e3) with the vector v = (1, 1, 1). That is, Φt = (et 1, et 2, et 3), where et i = (ei + tv)/ (ei + tv) 2 and t varying from 0 to 0.5 in steps of 0.1, which corresponds to condition numbers κ(Φt) varying from 1 to 2.5. We generate N = 4096 approximately 1-sparse noiseless signals from the signal model described in Table 1 with S = 1, T = 2, ρ = 0 and b = 0.1/0.2 and run both ITKM and K-SVD with 1000 iterations, sparsity parameter S = 1 and the true dictionary (basis) as initialization. Figure 1(a) shows the recovery error d(Φt, Ψ) between the original dictionary and the output of the respective algorithm averaged over 10 runs. As predicted by the theoretical results on the corresponding underlying minimization principles, the recovery errors of ITKM and K-SVD are roughly the same for Φ0, which is an orthogonal basis and therefore tight. However, while for ITKM the recovery error stays

1. A Matlab penknife (mini-toolbox) for playing around with ITKM and reproducing the experiments can be found at http://homepage.uibk.ac.at/~c7021041/ITKM.zip.

Signal Model Given the generating dictionary Φ our signal model further depends on four coeﬃcient parameters,

S - the eﬀective sparsity or number of comparatively large coeﬃcients, b - deciding the decay factor of these sparse coeﬃcients, T - the total number of non-zero coeﬃcients (T S) and ρ - the noise level.

Given these parameters we choose a decay factor cb uniformly at random in the interval [1 b, 1]. We set ci = ci b/

S for 1 i S and ci = 0 for T < i K. If T = S we renormalize the sequence to have unit norm, while if T > S we choose the vector (c S+1, . . . , c T ) uniformly at random on the sphere of radius R, where R is chosen such that the resulting sequence c has unit norm. We then choose a permutation p and a sign sequence σ uniformly at random and set y = Φcp,σ, respectively y = (Φcp,σ + r)/ p

1 + r 2 where r is a Gaussian noise-vector with variance ρ2 if ρ > 0.

Table 1: Signal Model

constantly low over all condition numbers, for K-SVD it increases with increasing condition number or non-tightness.

6.2 Recovery Error and Sample Size

The next experiment is designed to show how fast the maximizer Ψ near the original dictionary Φ converges to Φ with increasing sample size N. The generating dictionaries consist of the canonical basis in Rd for d = 4, 8, 16 and the ﬁrst d/2 elements of the Hadamard basis and as such are not tight. For every set of parameters d, S(T), b we generate N noiseless signals with N varying from 27 = 128 to 214 = 16384 and run ITKM with 1000 iterations, sparsity parameter S equal to the coeﬃcient parameter S and the true dictionary as initialization. Figure 1(b) shows the recovery error d(Φ, Ψ) between the original dictionary Φ and the output of ITKM Ψ averaged over 10 runs. As predicted by Theorem 4 the recovery error decays as N 1/2. However, the separation of the curves for d = 4, 8, 16 and almost exactly sparse signals (b = 0.01) by a factor around

2 instead of 4, as suggested by the estimate ε K2N 1/2, indicates that the cubic dependence of the sampling complexity on the number of atoms K may be too pessimistic and could be lowered.

6.3 Stability of Recovery Error under Coherence and Noise

With the last two experiments we illustrate the stability of the maximization criterion under coherence and noise. As generating dictionaries we use again the canonical basis plus half Hadamard dictionaries described in the last experiment, which have coherence µ = d 1/2.To test the stability under coherence we use a large enough number of noiseless training signals N = 16384, such that the distance between the local maximum of the criterion near the generating dictionary, that is the output of ITKM with oracle initialization, and the generating dictionary is mainly determined by the ratio between the gap size β and the coherence.

Local Identification of Overcomplete Dictionaries

1 1.3 1.6 1.9 2.2 2.5 0

condition number

ksvd, b=0.1 itkm, b=0.1 ksvd, b=0.2 itkm, b=0.2

7 8 9 10 11 12 13 14 !9

log2(number of signals)

log2(error)

4, 1(2), 0.01 8, 1(2), 0.01 16, 1(2), 0.01 4, 1(2), 0.1 8, 1(2), 0.1 16, 1(2), 0.1 8, 2(4), 0.01 16, 2(4), 0.01

Figure 1: (a) Local recovery error of K-SVD and ITKM for two diﬀerent types of decaying coeﬃcients and bases with varying condition numbers in R3, (b) Decay of recovery error of ITKM with increasing number of training signals

For each set of parameters d, S(T) we create N training signals with decreasing gap sizes β by increasing b from 0 to 0.1 in steps of 0.01 and run ITKM with oracle initialization, parameter S and 1000 iterations. Figure 2(a) shows the recovery error d(Φ, Ψ) between the original dictionary Φ and the output of ITKM Ψ again averaged over 10 trials. Again the experiments reﬂect our theoretical results. For d = 8, 16 with S = 1 or d = 16 with S = 2 the gap size is large enough that over the whole range of parameters the recovery error stays constantly low at the level deﬁned by the number of samples. Note that this is quite good, since for b = 0.1 we are already far beyond the gap size coherence ratios where the stable theoretical results hold. On the other hand for d = 8 with S = 2 or d = 16 with S = 3 early on the gap decreases enough to become the error determining factor and so we see an increase in recovery error as b grows. Conversely to test the stability under noise we use a large enough number of exactly sparse training signals, such that the recovery error will be mainly determined by the noise level. For each set of parameters d, S(S), b we create N = 16384 training signals with Gaussian noise of variance (noise level) ρ2 going from 0 to 0.1 in steps of 0.01 and run ITKM with oracle initialization, parameter S and 1000 iterations. Figure 2(b) shows the recovery error d(Φ, Ψ) between the original dictionary Φ and the output of ITKM Ψ, this time averaged over 20 trials. The curves again correspond to the prediction of the theoretical results, that is the recovery error stays at roughly the same level deﬁned by the number of samples until the noise becomes large enough and then increases. What is maybe interesting to observe in both experiments is the dithering eﬀect for d = 16 with S = 3, which is due to the special structure of the dictionary. Indeed using almost equally sized, almost exactly sparse coefﬁcients, it is possible to build signals using only the canonical basis that have almost the

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0

decrease in gap!size (b)

8, 1(4) 16, 1(4) 8, 2(4) 16, 2(4) 16, 3(6)

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.02

8, 2(2), 0.1 8, 2(2), 0.2 16, 2(2), 0.1 16, 2(2), 0.2 16, 3(3), 0.1 16, 3(3), 0.2

Figure 2: Increase of recovery error with (a) decreasing ratio between coeﬃcient gap and coherence and (b) increasing noise level

same response in only half the Hadamard basis and the other way round. This indicates that slight perturbations of one with the other lead to even better responses and therefore a larger recovery error. After showing that the theoretical results translate into algorithmic practice, we ﬁnally turn to a discussion of our results in the context of existing work and point out directions of future research.

7. Discussion

We have introduced a new response maximization principle for dictionary learning and shown that this is locally suitable to identify a generating µ-coherent dictionary from approximately S-sparse training signals to arbitrary precision as long as the sparsity level is of the order O(µ 1). We have also presented - to the best of our knowledge - the ﬁrst results showing that stable dictionary identiﬁcation is locally possible not only for signal to noise ratios of the order O(

d) but also for sparsity levels of the order O(µ 2). The derived sample complexity (omitting log factors) of O(K3d ε 2), for signals with sparsity levels S = O(1) is roughly the same as for the K-SVD criterion (Schnass, 2014) or the ℓ1-minimization criterion (Jenatton et al., 2014) but somewhat large compared to recently developed dictionary algorithms that have a sample complexity of O(K2) (Arora et al., 2014; Agarwal et al., 2014a) or O(Kε 2) (Agarwal et al., 2014b). However, as the sparsity approaches and goes beyond µ 1

d the derived sample complexity of O(K3d2 ε 2) compares quite favourably to the sample complexity of O(K1/(4η)) for a sparsity level d1/2 η

as projected by Arora et al. (2014). Given that also our experimental results suggest that O(K3d ε 2) is quite pessimistic, one future direction of research aims to lower the sample complexity. In particular ongoing work suggests that for the ITKM algorithm a sample size of order K log K is enough to guarantee local recovery with high probability.

Local Identification of Overcomplete Dictionaries

Another strong point of the results is that the corresponding maximization algorithm ITKM (Iterative Thresholding and K signed Means) is locally successful, as demonstrated in several experiments, and computationally very eﬃcient. The most complex step in each iteration is the matrix multiplication Φ Y of order O(d KN), which is even lighter than the iterative averaging algorithm described by Arora et al. (2014). However, the serious drawback is that ITKM is only a local algorithm and that all our results are only local. Also while for the K-SVD criterion and the ℓ1-minimization criterion there is reason to believe that all local minima might be equivalent, the response maximization principle has a lot of smaller local maxima, which is conﬁrmed by preliminary experiments with random initializations. There ITKM fails but with grace, that is, it outputs local maximizers that have not all, but only most atoms in common with what seems to be the global maximizer near the generating dictionary. This behaviour is in strong contrast to the algorithms presented by Arora et al. (2014); Agarwal et al. (2014b), that have global success guarantees at a computational cost of the order O(d N2), and leads to several very important research directions. First we want to conﬁrm that ITKM has a convergence radius of the order O(1/

S). This is suggested by the derived radius of the area on which the generating dictionary is the optimal maximizer as well as preliminary experiments. Alternatively, we could investigate how the results for the local iterative algorithms (Arora et al., 2014; Agarwal et al., 2014a) could be extended to larger sparsity levels and convergence radii using our techniques. The associated important question is how to extend the results for the algorithms presented by Arora et al. (2014); Agarwal et al. (2014b) to sparsity levels O(µ 2), if possible at lower cost than O(d N2). Given the conjectured size of the convergence radius for ITKM it would even be suﬃcient for the output of the algorithm to arrive at a dictionary within distance O(1/

S) to the generating dictionary, since the output could then be used as initialization for ITKM. A parallel approach for getting global identiﬁcation results for sparsity levels O(µ 2), that we are currently pursuing, is to analyze a version of ITKM using residual instead of pure signal means, which in preliminary experiments exhibits global convergence properties. The last research directions we want to point out are concerned with the realism of the signal model. The fact that for an input sparsity S a gap of order O(µ 2) between the S and S + 1 largest coeﬃcient is suﬃcient can be interpreted as a relaxed dependence of the algorithm on the sparsity parameter, since a gap of order µ 2 can occur quite frequently. To further decrease this sensitivity to the sparsity parameter in the criterion and the algorithm we would therefore like to extend our results to the case where we can only guarantee a gap of order O(µ 2) between the S largest and the S + T largest coeﬃcient for some T > 1. Last but not least we would like to exactly reﬂect the practical situation, where we would normalize our training signals to equally weight their importance and analyze the unit norm signal model where y = Φx + r/ Φx + r 2.

Acknowledgments

This work was supported by the Austrian Science Fund (FWF) under Grant no. J3335 and improved thanks to the reviewers comments and suggestions. Thanks also go to the

Computer Vision Laboratory of the University of Sassari, Italy, which provided the surroundings, where almost all of the presented work was done, and to Michael Dymond, who looked over the ﬁnal manuscript with a native speaker s eye.

Appendix A. Proofs

We now provide the full proofs for all our results.

A.1 Proof of Theorem 2

We ﬁrst reformulate and prove the theorem for the simple case of a symmetric coeﬃcient distribution based on one sequence and then use an integration argument to extend it to the general case.

Proposition 6 Let Φ be a unit norm frame with frame constants A B and coherence µ. Let x RK be a random permutation of a sequence c, where c1 c2 c3 . . . c K 0 and c 2 = 1, provided with random signs, that is x = cp,σ with probability P(p, σ) = (2KK!) 1. Assume that the signals are generated as y = Φx. If c satisﬁes c S > c S+1 + 2µ c 1 then there is a local maximum of (5) at Φ. Moreover for Ψ = Φ we have Ey max|I|=S Ψ Iy 1 < Ey max|I|=S Φ Iy 1 as soon as

d(Φ, Ψ) c S c S+1 2µ c 1

B (c S c S+1 2µ c 1)(c1+...+c S) . (15)

Proof We start by evaluating the objective function at the original dictionary Φ.

max |I|=S Φ Iy 1

max |I|=S Φ IΦcp,σ 1

i I | φi, Φcp,σ |

To estimate the sum of the (in absolute value) largest S inner products, we ﬁrst assume that p is ﬁxed. Setting Ip = p 1 ({1, . . . S}) we have,

| φi, Φcp,σ | = σicp(i) + X

j =i σjcp(j) φi, φj c S µ c 1 i Ip c S+1 + µ c 1 i / Ip .

Together with the condition that c S > c S+1 + 2µ c 1 these estimates ensure that the S maximal inner products in absolute value are attained at Ip and so we get for the expectation,

max |I|=S Φ IΦcp,σ 1

= Ep Eσ Φ IpΦcp,σ 1

cp(i) + σi X

j =i σjcp(j) φi, φj

= c1 + . . . + c S.

To compute the expectation for a perturbation of the original dictionary we use the following parameterization of all ε-perturbations Ψ of the original dictionary Φ. If d(Ψ, Φ) = ε then

Local Identification of Overcomplete Dictionaries

ψi φi 2 = εi with maxi εi = ε and we have zi with φi, zi = 0, zi 2 = 1 and αi := 1 ε2 i /2 and ωi := (ε2 i ε4 i /4) 1 2 , such that

ψi = αiφi + ωizi.

Expanding the expectation as before we get,

max |I|=S Ψ Iy 1

max |I|=S Ψ IΦcp,σ 1

i I | ψi, Φcp,σ |

The tried and tested strategy applied now is showing that for small perturbations and most sign patterns σ the maximal inner products are still attained by i Ip. We have

i Ip : | ψi, Φcp,σ | αi(c S µ c 1) ωi| zi, Φcp,σ |

i / Ip : | ψi, Φcp,σ | αi(c S+1 + µ c 1) + ωi| zi, Φcp,σ |.

Using Hoeﬀding s inequality we can estimate the typical sizes of the terms | zi, Φcp,σ |,

P(| zi, Φcp,σ | t) = P(| X

j =i σjcp(j) zi, φj | > t)

2 P j =i c2 p(j) zi, φj 2

In case ωi = 0 or equivalently εi = 0, we set t = s/ωi to arrive at

P(ωi| zi, Φcp,σ | s) 2 exp s2

where we have used that ω2 i = ε2 i ε4 i /4 ε2 i , while in case εi = 0 we trivially have that P(ωi| zi, Φcp,σ | s) = 0. Summarizing these ﬁndings we see that except with probability

i:εi =0 exp s2

i Ip : | ψi, Φcp,σ | αi(c S µ c 1) s

i / Ip : | ψi, Φcp,σ | αi(c S+1 + µ c 1) + s.

This means that as long as mini Ip αi(c S µ c 1) s maxi/ Ip αi(c S+1 +µ c 1)+s, which is for instance implied by setting s := 1

2(c S c S+1 2µ c 1 ε2

2 ), we have

max |I|=S Ψ IΦcp,σ 1 = Ψ IpΦcp,σ 1.

We now use this result for the calculation of the expectation over σ in (16). For any permutation p we deﬁne the set

i {σ s.t. ωi| zi, Φcp,σ | s}.

We then have

max |I|=S Ψ IΦcp,σ 1

σ Σp P(σ) max |I|=S Ψ IΦcp,σ 1 + X

σ / Σp P(σ) Ψ IpΦcp,σ 1

σ Σp P(σ) max |I|=S Ψ IΦcp,σ 1 Ψ IpΦcp,σ 1

+ Eσ Ψ IpΦcp,σ 1 . (17)

To estimate the sum over Σp, note that we have the following bounds:

| ψi, Φcp,σ | = |αi φi, Φcp,σ + ωi zi, Φcp,σ |

2 )| φi, Φcp,σ | + ε

2 )| φi, Φcp,σ | ε

max |I|=S Ψ IΦcp,σ 1 (1 ε2

2 ) max |I|=S Φ IΦcp,σ 1 + S ε

2 ) Φ IpΦcp,σ 1 + S ε

Ψ IpΦcp,σ 1 (1 ε2

2 ) Φ IpΦcp,σ 1 S ε

Substituting these estimates into (17) we get

max |I|=S Ψ IΦcp,σ 1

σ Σp P(σ) 2εS

B + Eσ Ψ IpΦcp,σ 1

B + Eσ Ψ IpΦcp,σ 1 .

Next we calculate Eσ Ψ IpΦcp,σ 1 :

Eσ Ψ IpΦcp,σ 1 = Eσ

i Ip | ψi, Φcp,σ |

αicp(i) + σi αiφi + ωizi, X

j =i σjcp(j)φj

i Ip αicp(i), (18)

where have used that ε ((1 ε2

2 )c S µ c 1)/

B guarantees the expression within absolute values in (18) to always be positive. Collecting all these results we arrive at the following estimate for the value of the objective function at Ψ:

max |I|=S Ψ Iy 1

max |I|=S Ψ IΦcp,σ 1

i:εi =0 exp

(c S c S+1 2µ c 1 ε2

i Ip αicp(i)

i:εi =0 exp

(c S c S+1 2µ c 1 ε2

+ c1 + . . . + c S

Local Identification of Overcomplete Dictionaries

Finally we are able to compare the expectation at the original dictionary to that at an ε-perturbation. Remembering that αi = 1 ε2 i 2 , we get

max |I|=S Φ Iy 1

max |I|=S Ψ Iy 1

c1 + . . . + c S

i:εi =0 exp

(c S c S+1 2µ c 1 ε2

ε2 c1 + . . . + c S

(c S c S+1 2µ c 1 ε2

Thus to have a local maximum at the original dictionary we need that

B c1 + . . . + c S exp

(c S c S+1 2µ c 1 ε2

and all that remains to be shown is that this is implied by (15). Since K 2, (15) implies that ε2

2 < c S c S+1 2µ c 1

2(1+3 log 96)2 c S c S+1 2µ c 1

100 and it suﬃces to show that (15) further implies

B c1 + . . . + c S exp (c S c S+1 2µ c 1)2 992

Applying Lemma A.3 by Schnass (2014), which says that for a, b, ε > 0,

1 + 16 log(a

b) implies that a exp b2

to the situation at hand, where a = 8SK2

B c1+...+c S and b = (c S c S+1 2µ c 1) 99

8 100 , we get that (19) is ensured by

ε < c S c S+1 2µ c 1

99 e1/16SK2

B (c S c S+1 2µ c 1)(c1+...+c S)

which simpliﬁes to (15).

Proof [of Theorem 2] Using the symmetry of ν, our strategy is to reduce the general to the simple coeﬃcient model. Let c denote the mapping that assigns to each x SK 1 the non increasing rearrangement of the absolute values of its components, that is ci(x) = |xp(i)| for a permutation p such that c1(x) c2(x) . . . c K(x) 0. Then the mapping c together with the probability measure ν on SK 1 induces a pull-back probability measure νc on c(SK 1), by νc(Ω) := ν(c 1(Ω)) for any measurable set Ω c(SK 1). With the help of this new measure

we can rewrite the expectations we need to calculate as

max |I|=S Φ Iy 1

max |I|=S Φ IΦx 1

x max |I|=S Φ IΦx 1dν = Z

max |I|=S Φ Φcp,σ(x) 1

The expectation inside the integral should seem familiar. Indeed we have calculated it already in the proof of Proposition 6 for c(x) a ﬁxed decaying sequence satisfying c S(x) > c S+1(x) + 2µ x 1. Since this property is satisﬁed almost surely we have

max |I|=S Φ Iy 1

max |I|=S Φ Φcp,σ(x) 1

c(x) c1(x) + . . . + c S(x)dνc := c1 + . . . + c S.

For the expectation of a perturbed dictionary Ψ we get in analogy

max |I|=S Ψ Iy 1

max |I|=S Ψ Φcp,σ(x) 1

c(x) η(x) + (c1(x) + . . . + c S(x)) 1

η(x) := 4εS

i:εi =0 exp

(c S(x) c S+1(x) 2µ x 1 ε2

Since c S(x) c S+1(x) 2µ x 1 β almost surely we have

i:εi =0 exp

almost surely and therefore

max |I|=S Ψ Iy 1

ηβ + ( c1 + . . . + c S) 1

Following the same argument as in the proof of Proposition 6 we see that Ey max|I|=S Φ Iy 1 > Ey max|I|=S Ψ Iy 1 as soon as

B β( c1+...+ c S) .

Local Identification of Overcomplete Dictionaries

A.2 Proof of Theorem 3

Again we ﬁrst reformulate and prove the theorem for the case of a symmetric coeﬃcient distribution based on one sequence and then extend it with an integration argument.

Proposition 7 Let Φ be a unit norm frame with frame constants A B and coherence µ. Let x RK be a random permutation of a sequence c, where c1 c2 c3 . . . c K 0 and c 2 = 1, provided with random signs, that is x = cp,σ with probability P(p, σ) = (2KK!) 1. Further let r = (r(1) . . . r(d)) be a centred random subgaussian noise-vector with parameter ρ and assume that the signals are generated according to the noisy signal model in (9). If we have

max{µ, ρ} c S c S+1 p

72(log a + log log a) for a = 112K2S(

B + 1) Cr(c S c S+1)(c1 + . . . + c S), (20)

where Cr = Er (1 + r 2 2) 1/2 , then there is a local maximum of (5) at Ψ satisfying

d( Ψ, Φ) 12SK2

B Cr(c1 + . . . + c S) exp (c S c S+1)2

72 max{µ2, ρ2}

Proof To prove the proposition we digress from the conventional scheme of ﬁrst calculating the expectation of our objective function for both the original and a perturbed dictionary and then comparing and instead bound the diﬀerence of the expectations directly.

max |I|=S Φ Iy 1

max |I|=S Ψ Iy 1

Φ I(Φcp,σ + r) p

1 max |I|=S

Ψ I(Φcp,σ + r) p

max|I|=S Φ I(Φcp,σ + r) 1 max|I|=S Ψ I(Φcp,σ + r) 1 p

:= Ep,σ,r( p,σ,r)

Again our strategy is to show that for a ﬁxed p for most σ and r the maximal response of both the original dictionary and the perturbation is attained at Ip. The expressions we therefore need to lower (upper) bound for i Ip (i / Ip) are

| φi, Φcp,σ + r | = σicp(i) + X

j =i σjcp(j) φi, φj + φi, r ,

| ψi, Φcp,σ + r | = αiσicp(i) + αi X

j =i σjcp(j) φi, φj + ωi zi, Φcp,σ + ψi, r .

However, instead of using a worst case estimate for the gap between the responses of the original dictionary inside and outside Ip, we now make use of the fact that for most sign sequences we have a gap size of order c S c S+1. This means that as soon as | P j =i σjcp(j) φi, φj |, ωi| zi, Φcp,σ | and the noise related terms | φi, r | and | ψi, r | are of order (c S c S+1) the maximal response of both the original dictionary and the perturbation is attained at Ip. In

particular, deﬁning the sets

j =i σjcp(j) φi, φj c S c S+1

6 or ωi| zi, Φcp,σ | c S c S+1 3ε2

for a ﬁxed permutation p and

r s.t. | φi, r | c S c S+1

3 or | ψi, r | c S c S+1

we see that both maxima are attained at Ip as long as σ / Σp and r / R. Using Hoeﬀding s inequality we get that

j =i σjcp(j) φi, φj > t

2 P j =i c2 p(j)| φi, φj |2

while from the proof of Proposition 6 we know that for εi = 0 we have P(ωi| zi, Φcp,σ |

s) 2 exp s2

. Setting t = (c S c S+1)/6, s = (c S c S+1 3ε2

2 )/6 and using a union bound then leads to

P(Σp) 2K exp

c S c S+1 3ε2

+ 2K exp (c S c S+1)2

Since the r(i) are subgaussian with parameter ρ we have for any v = (v1 . . . vd) and t 0, P(| v, r | t) exp t2 2ρ2 v 2 2

(Vershynin, 2012). Taking a union bound over all φi, ψi with the corresponding choice for t then leads to the estimate

P(R) 2K exp

c S c S+1 2

c S c S+1 2

We now split the expectations over the sign and noise patterns for a ﬁxed p to get

EσEr( p,σ,r) = Eσ

r/ R p,σ,rdνr

r R p,σ,rdνr

σ / Σp P(σ) Z

r/ R p,σ,rdνr + X

σ Σp P(σ) Z

r/ R p,σ,rdνr

r R p,σ,rdνr

Next note that Σp is symmetric in the sense that we either have (σ1, . . . , σi, . . . , σK) Σp or (σ1, . . . , σi, . . . , σK) / Σp. Thus we get for the ﬁrst term in (23),

σ / Σp P(σ) Z

r/ R p,σ,rdνr = Z

σ / Σp P(σ)

Φ Ip(Φcp,σ + r) 1 Ψ Ip(Φcp,σ + r) 1 p

σ / Σp P(σ)

P i Ip cp(i) ε2 i 2 p

Local Identification of Overcomplete Dictionaries

To bound the last two terms in (23) we ﬁrst ﬁnd an upper bound formax|I|=S Ψ I(Φcp,σ + r) 1:

max |I|=S Ψ I(Φcp,σ + r) 1 = max |I|=S

i I | αiφi + ωizi, Φcp,σ + r) |

1 ε2 i 2 | φi, Φcp,σ + r) | + εi Φcp,σ + r 2

2 | φi, Φcp,σ + r) | + ε

2 max |I|=S Φ I(Φcp,σ + r) 1 + εS

This then leads to the following lower bound for p,σ,r:

p,σ,r (1 + r 2 2) 1/2 max |I|=S Φ I(Φcp,σ + r) 1 ε2

(1 + r 2 2) 1/2 Φ Ip(Φcp,σ + r) 1 ε2

Using again the symmetry of Σp we have

σ Σp P(σ) Z

r/ R p,σ,rdνr Z

σ Σp P(σ) Φ Ip(Φcp,σ + r) 1 ε2

1 + r 2 2 dνr

P i Ip cp(i) ε2

1 + r 2 2 εS

P i Ip cp(i) ε2 i 2 p

1 + r 2 2 εS

and similarly

r R p,σ,rdνr

Φ Ip(Φcp,σ + r) 1 ε2

P i Ip cp(i) ε2 i 2 p

1 + r 2 2 εS

Resubstituting into (23) we get

EσEr( p,σ,r) Z

P i Ip cp(i) ε2 i 2 p

1 + r 2 2 εS

P i Ip cp(i) ε2 i 2 p

1 + r 2 2 εS

σ / Σp P(σ)

P i Ip cp(i) ε2 i 2 p

P i Ip cp(i) ε2 i 2 p

1 + r 2 2 dνr εS

B + 1 P(R) + P(Σp) . (24)

Taking the expectation over the permutations then yields

Ep,σ,r( p,σ,r) Er Ep

P i Ip cp(i) ε2 i 2 p

B + 1 P(R) + Ep P(Σp)

! c1 + . . . + c S

B + 1 P(R) + Ep P(Σp) .

Using the probability estimates from (21)/(22) we see that Ep,σ,r( p,σ,r) > 0 is implied by

where we have used the abbreviations γ = c1 + . . . + c S, β = c S c S+1 and Cr = Er (1 + r 2 2) 1/2 . We now proceed by splitting the above condition. We deﬁne εmin by asking that

72 max{µ2, ρ2}

and εmax implicitly by asking that

Following the line of argument in the proof of Proposition 6 we see that the above condition is guaranteed as soon as

log 112K2S(

B+1) Crβγ := εmax.

Local Identification of Overcomplete Dictionaries

The statement follows from making sure that εmin < εmax.

Proof [of Theorem 3] Using the pull-back probability measure νc we can write

max |I|=S Φ Iy 1

max |I|=S Ψ Iy 1

c(x) Ep,σ,r p,σ,r,c(x) dνc,

where p,σ,r,c(x) is deﬁned analogue to p,σ,r in the last proof, that is replacing c by c(x). The statement follows from employing the lower estimate for Ep,σ,r p,σ,r,c(x) from (24) and replacing c1 +. . .+c S by c1 +. . .+ c S resp. c S c S+1 by its lower bound β in the proof of Proposition 7.

A.3 Proof of Theorems 4 and 5

Since the proofs of Theorems 4 and 5 are conceptually equivalent we will combine them into one and just split the argument for the inevitable juggling of constants. Proof As outlined in the proof idea we need a Lipschitz property for the mapping Ψ 1 N PN n=1 max|I|=S Ψ Iyn 1 for both signal models, the concentration of the sum around its expectation for a δ net covering the space of all admissible dictionaries close to Φ and a triangle inequality argument to get to the ﬁnal statement. To show the Lipschitz property we use a reverse triangle inequality: max |I|=S Ψ Iyn 1 max |I|=S Ψ Iyn 1

= max |I|=S Ψ Iyn ( Ψ I Ψ I)yn 1 max |I|=S Ψ Iyn 1

max |I|=S ( Ψ I Ψ I)yn 1

S max k ψk ψk 2 yn 2

Note that for the noise-free signal model we can replace

B in the last expression. By averaging over n we get that the mapping in question is Lipschitz with constant S

B + 1 in the noisy and S

B in the noise-free case, that is 1 N

n=1 max |I|=S Ψ Iyn 1 1

n=1 max |I|=S Ψ Iyn 1

To show that the averaged sums concentrate around their expectations we use our favourite tool Hoeﬀding s inequality. Set Xn = max|I|=S Φ Iyn 1 max|I|=S Ψ Iyn 1, then we have |Xn| εS

B + 1 , resp. |Xn| εS

B in the noise-free case, and get the estimate

max |I|=S Φ Iyn 1 max |I|=S Ψ Iyn 1

E max |I|=S Φ Iy1 1 max |I|=S Ψ Iy1 1

Next we need to choose a δ-net for all perturbations Ψ with d(Φ, Ψ) εmax, that is a ﬁnite set of perturbations N such that for every Ψ we can ﬁnd Ψ N with d(Ψ, Ψ) δ. Recalling the parameterization of all ε-perturbations from the proof of Proposition 6, we see that the space we need to cover is included in the product of K balls with radius εmax in dimension d. For instance from Lemma 2 by Vershynin (2012) we know that for the d dimensional ball of radius εmax we can ﬁnd a δ-net Nd satisfying Nd εmax + 2εmax

δ d, so for our space of ε-perturbations we can ﬁnd a δ-net N satisfying

N εmax + 2εmax

Taking a union bound we can now estimate the probability that we have concentration for all perturbations in the net as

max |I|=S Φ Iyn 1 max |I|=S Ψ Iyn 1

E max |I|=S Φ Iy1 1 max |I|=S Ψ Iy1 1

Finally we are ready for the triangle inequality argument. For any Ψ with d(Ψ, Φ) = ε εmax we can ﬁnd Ψ N with d( Ψ, Ψ) δ and d(Φ, Ψ) = ε and therefore get

n=1 max |I|=S Φ Iyn 1 1

n=1 max |I|=S Ψ Iyn 1

n=1 max |I|=S Φ Iyn 1 E max |I|=S Φ Iy1 1

+ E max |I|=S Φ Iy1 1

E max |I|=S Ψ Iy1 1

+ E max |I|=S Ψ Iy1 1

n=1 max |I|=S Ψ Iyn 1 + 1

n=1 max |I|=S Ψ Iyn 1 1

n=1 max |I|=S Ψ Iyn 1

E max |I|=S Φ Iy1 1

E max |I|=S Ψ Iy1 1

Depending on the signal model we now have to substitute the values for the asymptotic diﬀerences E max|I|=S Φ Iy1 1 E max|I|=S Ψ Iy1 1 calculated in the previous proofs. Under the conditions given in Theorem 4 we have,

n=1 max |I|=S Φ Iyn 1 1

n=1 max |I|=S Ψ Iyn 1

ε2 c1 + . . . + c S

Local Identification of Overcomplete Dictionaries

To make sure that the above expression is larger than zero, we split it into two conditions. The ﬁrst condition

ε2 c1 + . . . + c S

is satisﬁed as soon as

B β( c1+...+ c S) .

To concretise the second condition

ε2 c1 + . . . + c S

we choose t = ε2 c1+...+ c S

16K and δ = ε2 c1+...+ c S

B to arrive at ε 2 ε. Given that ε diﬀers at most by δ from ε we see that (25) is larger than zero except with probability

N ε4( c1 + . . . + c S)2

128ε2max S2K2B + Kd log

B ε2( c1 + . . . + c S)

εmin := ε + ε2

8K ε εmax β

B β( c1+...+ c S) ε2

While for the asymptotic results we tried to make εmax as large as possible to indicate how large the basin of attraction could be, for the ﬁnite sample size results we want it as small as possible in order to keep the sampling complexity small and therefore choose εmax = εmin. The statement then follows from making sure that the right most inequality in (26) is satisﬁed and simpliﬁcations. In case of the noisy signal model, that is under the conditions given in Theorem 5, we have

n=1 max |I|=S Φ Iyn 1 1

n=1 max |I|=S Ψ Iyn 1

ε2 c1 + . . . + c S

2Cr K 2t δS

B + 1 2 εSK

Splitting equally gives us four conditions:

c1 + . . . + c S exp β2

c1 + . . . + c S exp β2

ε2 8Cr K c1 + . . . + c S

64) > 16Cr SK2

c1 + . . . + c S exp

Choosing t = ε2 µ,ρ c1+...+ c S

32Cr K and δ = ε2 µ,ρ c1+...+ c S 16KS(

B+1) we can merge the ﬁrst three conditions

to ε ε2 µ,ρ, while following the usual argument, Condition (28) is satisﬁed once

βCr( c1+...+ c S)

Given that ε diﬀers at most by δ from ε we see that (27) is larger than zero except with probability

N ε4 µ,ρ( c1 + . . . + c S)2

512ε2max C2r S2K2

B + 1 2 + Kd log

ε2µ,ρ( c1 + . . . + c S)

εmin := εµ,ρ + ε2 µ,ρ 16K ε εmax β

βCr( c1+...+ c S)

Again the statement follows from choosing εmax = εmin, making sure that the right most inequality in (29) is satisﬁed and simpliﬁcations.

A. Agarwal, A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon. Learning sparsely used overcomplete dictionaries via alternating minimization. COLT 2014 (ar Xiv:1310.7991), 2014a.

A. Agarwal, A. Anandkumar, and P. Netrapalli. Exact recovery of sparsely used overcomplete dictionaries. COLT 2014 (ar Xiv:1309.1952), 2014b.

M. Aharon, M. Elad, and A.M. Bruckstein. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing., 54 (11):4311 4322, November 2006.

Local Identification of Overcomplete Dictionaries

S. Arora, R. Ge, and A. Moitra. New algorithms for learning incoherent and overcomplete dictionaries. COLT 2014 (ar Xiv:1308.6273), 2014.

E. Cand es, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489 509, 2006.

O. Christensen. An Introduction to Frames and Riesz Bases. Birkh auser, 2003.

D.L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4): 1289 1306, 2006.

D.L. Donoho, M. Elad, and V.N. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory, 52(1): 6 18, January 2006.

D.J. Field and B.A. Olshausen. Emergence of simple-cell receptive ﬁeld properties by learning a sparse code for natural images. Nature, 381:607 609, 1996.

Q. Geng, H. Wang, and J. Wright. On the local correctness of ℓ1-minimization for dictionary learning. ar Xiv:1101.5672, 2011.

P. Georgiev, F.J. Theis, and A. Cichocki. Sparse component analysis and blind source separation of underdetermined mixtures. IEEE Transactions on Neural Networks, 16(4): 992 996, 2005.

A. Gersho and R.M. Gray. Vector Quantization and Signal Compression. Springer, 1992.

R. Gribonval and K. Schnass. Dictionary identiﬁability - sparse matrix-factorisation via l1minimisation. IEEE Transactions on Information Theory, 56(7):3523 3539, July 2010.

R. Gribonval, R. Jenatton, F. Bach, M. Kleinsteuber, and M. Seibert. Sample complexity of dictionary learning and other matrix factorizations. ar Xiv:1312.3790, 2013.

D. Hsu, S.M. Kakade, and T. Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability (ar Xiv:1110.2842), 17(14), 2012.

R. Jenatton, F. Bach, and R. Gribonval. Sparse and spurious: dictionary learning with noise and outliers. ar Xiv:1407.5155, 2014.

K. Kreutz-Delgado and B.D. Rao. FOCUSS-based dictionary learning algorithms. In SPIE 4119, 2000.

K. Kreutz-Delgado, J.F. Murray, B.D. Rao, K. Engan, T. Lee, and T.J. Sejnowski. Dictionary learning algorithms for sparse representation. Neural Computations, 15(2):349 396, 2003.

J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19 60, 2010.

A. Maurer and M. Pontil. K-dimensional coding schemes in Hilbert spaces. IEEE Transactions on Information Theory, 56(11):5839 5846, 2010.

N.A. Mehta and A.G. Gray. On the sample complexity of predictive sparse coding. ar Xiv:1202.4050, 2012.

M.D. Plumbley. Dictionary learning for ℓ1-exact sparse coding. In M.E. Davies, C.J. James, and S.A. Abdallah, editors, International Conference on Independent Component Analysis and Signal Separation, volume 4666, pages 406 413. Springer, 2007.

R. Rubinstein, A. Bruckstein, and M. Elad. Dictionaries for sparse representation modeling. Proceedings of the IEEE, 98(6):1045 1057, 2010.

K. Schnass. On the identiﬁability of overcomplete dictionaries via the minimisation principle underlying K-SVD. Applied and Computational Harmonic Analysis, 37(3):464 491, 2014.

K. Schnass and P. Vandergheynst. Average performance analysis for thresholding. IEEE Signal Processing Letters, 14(11):828 831, 2007.

K. Schnass and P. Vandergheynst. Dictionary preconditioning for greedy algorithms. IEEE Transactions on Signal Processing, 56(5):1994 2002, 2008.

K. Skretting and K. Engan. Recursive least squares dictionary learning algorithm. IEEE Transactions on Signal Processing, 58(4):2121 2130, April 2010.

D. Spielman, H. Wang, and J. Wright. Exact recovery of sparsely-used dictionaries. In COLT 2012 (ar Xiv:1206.5882), 2012.

J.A. Tropp. On the conditioning of random subdictionaries. Applied and Computational Harmonic Analysis, 25(1-24), 2008.

D. Vainsencher, S. Mannor, and A.M. Bruckstein. The sample complexity of dictionary learning. Journal of Machine Learning Research, 12(3259-3281), 2011.

R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and G. Kutyniok, editors, Compressed Sensing, Theory and Applications, chapter 5. Cambridge University Press, 2012.

M. Yaghoobi, T. Blumensath, and M.E. Davies. Dictionary learning for sparse approximations with the majorization method. IEEE Transactions on Signal Processing, 57(6): 2178 2191, June 2009.

M. Zibulevsky and B.A. Pearlmutter. Blind source separation by sparse decomposition in a signal dictionary. Neural Computations, 13(4):863 882, 2001.