# local_identification_of_overcomplete_dictionaries__b871bae4.pdf Journal of Machine Learning Research 16 (2015) 1211-1242 Submitted 1/14; Revised 7/14; Published 6/15 Local Identification of Overcomplete Dictionaries Karin Schnass karin.schnass@uibk.ac.at Department of Mathematics University of Innsbruck Technikerstraße 13 6020 Innsbruck, Austria Editor: Shie Mannor This paper presents the first theoretical results showing that stable identification of overcomplete µ-coherent dictionaries Φ Rd K is locally possible from training signals with sparsity levels S up to the order O(µ 2) and signal to noise ratios up to O( d). In particular the dictionary is recoverable as the local maximum of a new maximization criterion that generalizes the K-means criterion. For this maximization criterion results for asymptotic exact recovery for sparsity levels up to O(µ 1) and stable recovery for sparsity levels up to O(µ 2) as well as signal to noise ratios up to O( d) are provided. These asymptotic results translate to finite sample size recovery results with high probability as long as the sample size N scales as O(K3d S ε 2), where the recovery precision ε can go down to the asymptotically achievable precision. Further, to actually find the local maxima of the new criterion, a very simple Iterative Thresholding and K (signed) Means algorithm (ITKM), which has complexity O(d KN) in each iteration, is presented and its local efficiency is demonstrated in several experiments. Keywords: dictionary learning, dictionary identification, sparse coding, sparse component analysis, vector quantization, K-means, finite sample size, sample complexity, maximization criterion, sparse representation 1. Introduction Be it the 300 million photos uploaded to Facebook per day, the 800GB the large Hadron collider records per second or the 320.000GB per second it cannot record, it is clear that we have reached the age of big data. Indeed, in 2012, the amount of data existing worldwide is estimated to have reached 2.8 ZB = 2.800 billion GB and while 23 % of these data are expected to be useful if analyzed, only 1% actually are. So how do we deal with this big data challenge? The key concept, that has driven data processing and data analysis in the past decade, is that even high-dimensional data has intrinsically low complexity, meaning that every data point y can be represented as linear combination of a sparse (small) number of elements or atoms φi Rd, φ 2 = 1 of an overcomplete dictionary Φ = (φ1, . . . φK), that is, y ΦIx I = X i I x(i)ϕi, for a set I of size S, |I| = S, which is small compared to the ambient dimension, S d K. These sparse components do not only describe the data but the representations can also be c 2015 Karin Schnass. used for a myriad of efficient sparsity based data processing schemes, ranging from denoising (Donoho et al., 2006) to compressed sensing (Donoho, 2006; Cand es et al., 2006). Therefore a promising tool both for data analysis and data processing, that has emerged in the last years, is dictionary learning, also known as sparse coding or sparse component analysis. Dictionary learning addresses the fundamental question of how to automatically learn a dictionary, providing sparse representations for a given data class. That is, given N signals yn Rd, stored as columns in a matrix Y = (y1, . . . , y N), find a decomposition into a d K dictionary matrix Φ with unit norm columns and a K N coefficient matrix with sparse columns. Until recently the main research focus in dictionary learning has been on the development of algorithms. Thus by now there is an ample choice of learning algorithms, that perform well in experiments and are popular in applications (Field and Olshausen, 1996; Kreutz Delgado and Rao, 2000; Kreutz-Delgado et al., 2003; Aharon et al., 2006; Yaghoobi et al., 2009; Mairal et al., 2010; Skretting and Engan, 2010; Rubinstein et al., 2010). However, slowly the interest is shifting and researchers are starting to investigate also the theoretical aspects of dictionary learning. Following the first theoretical insights, originating in the blind source separation community (Zibulevsky and Pearlmutter, 2001; Georgiev et al., 2005), there is now a set of generalization bounds predicting how well a learned dictionary can be expected to sparsely approximate future data (Maurer and Pontil, 2010; Vainsencher et al., 2011; Mehta and Gray, 2012; Gribonval et al., 2013). These results give a theoretical foundation for dictionary learning as data processing tool, for example for compression, but unfortunately do not give guarantees that an efficient algorithm will find/recover a good dictionary provided that it exists. However, in order to justify the use of dictionary learning as data analysis tool, for instance in blind source separation, it is important to provide conditions under which an algorithm or scheme can identify the dictionary from a finite number of training signals, that is, the sources from the mixtures. Following the first dictionary identification results for the ℓ1-minimization principle (Gribonval and Schnass, 2010; Geng et al., 2011; Jenatton et al., 2014), which was suggested by Zibulevsky and Pearlmutter (2001)/Plumbley (2007), and for the ER-SPu D algorithm for learning a basis (Spielman et al., 2012), 2013 has seen a number of interesting developments. First it was shown that the K-SVD minimization principle suggested by Aharon et al. (2006) can locally identify overcomplete tight dictionaries (Schnass, 2014). Later algorithms with global identification guarantees for coherent dictionaries were presented (Arora et al., 2014; Agarwal et al., 2014b). Finally it was shown that an alternating minimization method is locally convergent to the correct generating dictionary (Agarwal et al., 2014a). One aspect that all these results have in common is that the sparsity level of the training signals required for successful identification is of order O(µ 1) or O( d) for incoherent dictionaries. Considering that on average sparse recovery in a given dictionary is successful for sparsity levels O(µ 2) (Tropp, 2008; Schnass and Vandergheynst, 2007) and that for dictionary learning we usually have a lot of training signals at our disposal, the same sparsity level should be sufficient for dictionary learning and indeed in this paper we provide the first indication that global dictionary identification could be possible for sparsity levels O(µ 2) by proving that it is locally possible. Further we show that in experiments a very simple iterative Local Identification of Overcomplete Dictionaries algorithm, based on thresholding and K signed means, is locally successful. The paper is organized as follows. After introducing all necessary notation in Section 2 we present a new optimization criterion, motivated by the analysis of the K-SVD principle (Schnass, 2014) in Section 3. In Section 4 we give asymptotic identification results both for exact and stable recovery, which in Section 5 are extended to results to finite sample sizes. Section 6 provides an algorithm for actually finding a local optimum and some experiments confirming the theoretical results. Finally in the last section we compare the results for the new criterion to existing identification results, discuss the implications of these local results for global dictionary identification algorithms and point out directions for future research. 2. Notations and Conventions Before we jump into the fray, we collect some definitions and lose a few words on notations; usually subscripted letters will denote vectors with the exception of c and ε where they are numbers, e.g., (x1, . . . , x K) = X Rd K vs. c = (c1, . . . , c K) RK, however, it should always be clear from the context what we are dealing with. We consider a dictionary Φ a collection of K unit norm vectors φi Rd, φi 2 = 1. By abuse of notation we will also refer to the d K matrix collecting the atoms as its columns as the dictionary, that is Φ = (φi, . . . φK). The maximal absolute inner product between two different atoms is called the coherence µ of the dictionary, µ = maxi =j | φi, φj |. By ΦI we denote the restriction of the dictionary to the atoms indexed by I, that is ΦI = (φi1 . . . φi S), ij I. We indicate the conjugate transpose of a matrix with a , for example Φ would be the transpose of Φ. The set of all dictionaries of a given size (d K) is denoted by D. For two dictionaries Φ, Ψ D we define the distance between each other as the maximal distance between two corresponding atoms, d(Φ, Ψ) := max i φi ψi 2. We consider a frame F a collection of K d vectors fi Rd for which there exist two positive frame constants A, B such that for all v Rd we have i=1 | fi, v |2 B v 2 2. (1) From (1) it follows that F, interpreted as d K matrix, has rank d and that its non-zero singular values are in the interval [ B]. If B can be chosen equal to A, that is B = A, the frame is called tight. If all frame elements fi have unit norm, we call F a unit norm frame. For more details on frames see for instance the introduction by Christensen (2003). Finally we introduce the Landau symbols O, o to characterise the growth of a function. We write f(t) = O(g(t)) if lim t 0/ f(t)/g(t) = C < and f(t) = o(g(t)) if lim t 0/ f(ε)/g(ε) = 0. 3. A Response Maximization Criterion One of the origins of dictionary learning can be found in the field of vector quantization, where the aim is to find a codebook (dictionary) such that the codewords (atoms) closely represent the data, that is min Φ,X Y ΦX 2 F s.t. xn {ei}i. Indeed the vector quantization problem can be seen as an extreme case of dictionary learning, where we do not only want all our signals to be approximately 1-sparse but also want the single non-zero coefficient equal to one. On the other hand we allow the atoms (codewords) to have any length. The problem above is usually solved by a K-means algorithm, which alternatively separates the training data into K clusters, each assigned to one codeword, and then updates the codeword to be the mean of the associated train signals. For more detailed information about vector quantization or the K-means algorithm see for instance the book by Gersho and Gray (1992) or the introduction by Aharon et al. (2006). If we relax the condition that each coefficient has to be positive, but in turn ask for the atoms to have unit norm, we are already getting closer to the concept of 1-sparse dictionary learning, min Φ D,X Y ΦX 2 F s.t. xn { ei}i. This minimization problem can be rewritten as n min i,σi= 1 yn σiφi 2 2 = min Φ D n min i,σi= 1 yn 2 2 2σi yn, φi + φi 2 2 = Y 2 F + N 2 max Φ D n max i | yn, φi |, and is therefore equivalent to the maximization problem n max i | yn, φi |. (2) A local maximum of (2) can be found with a signed K-means algorithm, which assigns each training signal to the atom of the current dictionary giving the largest response in absolute value and then updating the atom as normalized signed mean of the associated training signals, see Section 6 for more details. The question now is how do we go from these 1-sparse dictionary learning formulations to S-sparse formulations with S > 1. The most common generalization, which provides the starting point for the MOD and the KSVD algorithm, is to give up all constraints on the coefficients except for S-sparsity and to minimize, (PP2) min Φ D,X Y ΦX 2 F s.t. xn 0 S. (3) However, rewriting the problem we see that this formulation does not reduce to the same maximization problem in case S = 1. Then the best one term approximation in the given Local Identification of Overcomplete Dictionaries dictionary is simply the largest projection onto one atom and we have n min i,xi yn xiφi 2 2 = min Φ D n min i yn φi, yn φi 2 2 = Y 2 F max Φ D n max i | yn, φi |2, leading instead to the maximization problem n max i | yn, φi |2 vs. max Φ D n max i | yn, φi |. A local maximum can now be found using the same partitioning strategy as before but updating the atoms as largest singular vector rather than signed mean of the associated training signals, requiring K SVDs as opposed to K means. While the minimization problem (3) is definitely the most effective generalization for dictionary learning when the goal is compression, it brings with it some complications when used as analysis tool. Indeed it has been shown (Schnass, 2014) that for S = 1 the K-SVD criterion (3) can only identify the underlying dictionary from sparse random mixtures to arbitrary precision (given enough training samples) if this dictionary is tight and it is conjectured that the same holds for S 1. Roughly simplified the reason for this is that for random sparse signals ΦIx I and an ε-perturbation Ψ the average of the largest squared response behaves like 2 + c(Ψ) 2 + 1 2 c(Ψ) 2 = 1 ε2 + ε4 If Φ is tight the term c(Ψ) is constant over all dictionaries and therefore there is a local maximum at Φ. From the above we also see that the average of the largest absolute response should behave like 2 + c(Ψ) + 1 2 c(Ψ) = 1 ε2 meaning that we should have a maximum at Φ also if it is non-tight. This suggests as alternative way to generalize the K-means optimization principle for dictionary identification to simply maximize the absolute norm of the S-largest responses, (PR1) max Ψ D n max |I|=S Ψ Iyn 1. (4) Other than for the K-SVD criterion it is not obvious that there should be a local optimum of (4) at Φ even if all signals yn are perfectly S-sparse in Φ. Therefore it is quite intriguing that we will not only be able to prove local identifiability of any generating dictionary via (4) from randomly sparse signals, but that these identifiability properties are stable under coherence and noise. However, before we get to the main result in Theorem 5 on page 1221, we first have to lay the foundation, by providing suitable random signal models and by studying the asymptotic identifiability properties of the new principle. 4. Asymptotic Results We can get to the asymptotic version of the S-response maximization principle in (4) simply by replacing the sum over the training signals with the expectation, leading to max |I|=S Ψ Iy 1 Next we need a random sparse coefficient model to generate our signals y. We make the following definition, see also Schnass (2014). Definition 1 A probability distribution (measure) ν on the unit sphere SK 1 RK is called symmetric if for all measurable sets X SK 1, for all sign sequences σ { 1, 1}K and all permutations p we have ν(σX) = ν(X), where σX := {(σ1x1, . . . , σKx K) : x X}, and ν(p(X)) = ν(X), where p(X) := {(xp(1), . . . , xp(K)) : x X}. Setting y = Φx where x is drawn from a symmetric probability measure ν on the unit sphere has the advantage that for dictionaries which are orthonormal bases the resulting signals have unit norm and for general dictionaries the signals have unit square norm in expectation, that is E( y 2 2) = 1. This reflects the situation in practical application, where we would normalize the signals in order to equally weight their importance. One example of such a probability measure can be constructed from a non-negative, nonincreasing sequence c RK with c 2 = 1, which we permute uniformly at random and provide with random signs. To be precise for a permutation p : {1, ..., K} {1, ..., K} and a sign sequence σ, σi = 1, we define the sequence cp,σ component-wise as cp,σ(i) := σicp(i), and set ν(x) = (2KK!) 1 if there exist p, σ such that x = cp,σ and ν(x) = 0 otherwise. While being very simple this measure exhibits all the necessary structure and indeed in our proofs we will reduce the general case of a symmetric measure to this simple case. So far we have not incorporated any sparse structure in our coefficient distribution. To motivate the sparsity requirements on our coefficients we will recycle the simple negative example of a sparse coefficient distribution for which the original generating dictionary is not at a local maximum of (5) with S = 1 (Schnass, 2014). Example 1 Let U be an orthonormal basis and let the signals be constructed as y = Φx. If x is randomly 2-sparse with flat coefficients, that is, drawn from the simple symmetric probability measure with base sequence c, where c1 = c2 = 1/ 2, ci = 0 for i 3, then U is not a local maximum of (5) with S = 1. Indeed, since the signals are all 2-sparse, the maximal inner product with all atoms in U is the same as the maximal inner product with only d 1 atoms. This degree of freedom we can use to construct an ascent direction. Choose Uε = (u1, . . . , ud 1, (ud + εu1)/ Local Identification of Overcomplete Dictionaries then we have Ey ( U ε y ) = Ex x1, . . . , xd 1, xd + εx1 2, xd + εx1 1 1 d(d 1) + 1 d(d 1) 1 + ε 1 + 1 d(d 1) ε ε2 2 = Ey ( U y ) . From the above example we see that, in order to have a local maximum of (5) with S = 1 at the original dictionary, we need our signals to be truly 1-sparse, that is, we need to have a decay between the first and the second largest coefficient. In the following sections we will study how large this decay should be to have a local maximum exactly at or near to the generating dictionary for more general dictionaries and sparsity levels. 4.1 Exact Recovery To warm up we first provide an asymptotic exact dictionary identification result for (5) for incoherent dictionaries in the noiseless setting. Theorem 2 Let Φ be a unit norm frame with frame constants A B and coherence µ. Let the coefficients x be drawn from a symmetric probability distribution ν on the unit sphere SK 1 RK and assume that the signals are generated as y = Φx. If there exists β > 0 such that for c1(x) c2(x) . . . c K(x) 0 the non-increasing rearrangement of the absolute values of the components of x we have c S(x) c S+1(x) 2µ x 1 β almost surely, that is ν (c S(x) c S+1(x) 2µ x 1 β) = 1, (6) then there is a local maximum of (5) at Φ. Moreover for Ψ = Φ we have Ey max|I|=S Ψ Iy 1 < Ey max|I|=S Φ Iy 1 as soon as B β( c1+...+ c S) , (7) where ci := Ex(ci(x)). Proof idea We briefly sketch the main ideas of the proof, which are the same as for the corresponding theorem for the K-SVD principle (Schnass, 2014). For self-containedness of the paper the full proof is included in Appendix A.1. Assume that we have the case of a simple probability measure based on one sequence c, that is x = cp,σ. For any fixed permutation p the condition in (6) ensures that for all sign sequences σ, and consequently all signals, the maximal S responses of the original dictionary Φ are attained at Ip = p 1 ({1 . . . S}) and that there is a gap of size β to the remaining responses. For an ε-perturbation of the generating dictionary we have ψi (1 ε2 i /2)φi +εizi for some unit vectors zi with zi, φi = 0 and εi ε. Now for most sign sequences the contribution of εizi to the response ψi, Φcp,σ will be smaller than β/2 so the maximal S responses will still be attained at Ip. Comparing the loss of the perturbed dictionary over the typical sign sequences of all permutations, which scales as (c1+...+c S) 2K P ε2 i , to the maximal gain Sε B over the approximately 2 P i exp β2/ε2 i atypical sign sequences shows that there is a maximum at the original dictionary. The general result follows from an integration argument. As already mentioned, while for the K-SVD criterion (3) there is always an optimum at the generating dictionary if all training signals are S-sparse, this it is not obvious for the response principle. Indeed, in the special case where all the training signals are exactly S-sparse, c S+1(x) = 0 almost surely, we get an additional condition to ensure asymptotic recoverability, s=1 cs(x) β > 0, almost surely. To get a better feeling for this constraint we bound the sum over the S largest responses by S times the largest response, PS s=1 cs(x) Sc1(x) and arrive at the condition c1(x) 2Sµ, (8) which is the classical condition under which simple thresholding will find the support of an exactly S-sparse signal (Schnass and Vandergheynst, 2008). 4.2 Stability under Coherence and Noise While giving a first insight into the identification properties of the response principle, Theorem 2 suffers from two main limitations. First, the required condition on the coherence of the dictionary with respect to the decay of the coefficients, c S(x) c S+1(x) 2µ c(x) 1 > 0, is unfortunately quite strict. In the most favourable case of exactly S sparse signals with equally sized coefficients, c1(x) = c S(x) = 1/ S, we see from (8) that we can only identify dictionaries from very sparse signals, where S = µ 1. In case of very incoherent dictionaries with µ = O(1/ d) this means that S d. However, for most sign sequences σ we have | φi, Φcp,σ | = σicp(i) + X j =i σjcp(j) φi, φj cp(i) j =i c2 p(j)| φi, φj |2 which indicates that a condition of the form µ c S c S+1 may be strong enough to guarantee (approximate) recoverability of the dictionary. Assuming again the most favourable case of equally sized coefficients, we could therefore identify dictionaries from signals with sparsity levels of the order S µ 2, which, in case of incoherent dictionaries, means of the order of the ambient dimension S d. Local Identification of Overcomplete Dictionaries The second limitation of Theorem 2 is that, even if it allows for not exact S-sparseness of the signals, it does not take into account noise. Our next goal is therefore to extend the exact identification result in Theorem 2 to a stable identification result for less sparse (larger S) and noisy signals. For this task we first need to amend our signal model to incorporate noise. We would like to consider unbounded white noise, but also keep the property that in expectation the signals have unit square norm. Further for the next section, where we want to transform our asymptotic identification results to results for finite sample sizes, it will be convenient if our signals are bounded. These considerations lead to the following model: y = Φx + r p 1 + r 2 2 , (9) where r = (r(1) . . . r(d)) is a centred random subgaussian vector with parameter ρ. That is, the entries r(i) are independent and satisfy E(et r(i)) et2ρ2/2. Employing this noisy signal model and formalizing the ideas about the typical gap size between responses of the generating dictionary inside and outside the true support, leads to the following theorem. Theorem 3 Let Φ be a unit norm frame with frame constants A B and coherence µ. Let the coefficients x be drawn from a symmetric probability distribution ν on the unit sphere SK 1 RK. Further let r = (r(1) . . . r(d)) be a centred random subgaussian noise-vector with parameter ρ and assume that the signals are generated according to the noisy signal model in (9). If there exists β > 0 such that for c1(x) c2(x) . . . c K(x) 0 the non-increasing rearrangement of the absolute values of the components of x we have c S(x) c S+1(x) β almost surely and max{µ, ρ} β p 72(log a + log log a) for a = 112K2S( B + 1) Crβ( c1 + . . . + c S), (10) where Cr = Er (1 + r 2 2) 1/2 and ci := Ex(ci(x)), then there is a local maximum of (5) at Ψ satisfying d( Ψ, Φ) 12SK2 B Cr( c1 + . . . + c S) exp β2 72 max{µ2, ρ2} Proof idea As outlined at the beginning of the section the main ingredient we have to add to the proof idea of Theorem 2 is a probabilistic argument to substitute the condition guaranteeing that the S largest responses of the generating dictionary are Ip. Due to concentration of measure we get that for most sign sequences, and therefore most signals, the maximum is still attained at Ip. Moreover the gap to the remaining responses is actually large enough to accommodate relatively high levels of noise and/or perturbations. The detailed proof can be found in Appendix A.2. Let us make some observations about the last result. First, we want to point out that for sub-Gaussian noise with parameter ρ, the quantity Cr = Er (1 + r 2 2) 1/2 in the statement above is well behaved. If for example the r(i) are iid Bernoulli-variables, that is P(r(i) = ρ) = 1 2, we have Cr = (1+dρ2) 1/2. In general we have the following estimate due for instance to Theorem 1 by Hsu et al. (2012). Since we have P x 2 2 ρ2(d + 2 dt + 2t e t, setting t = d, we get P x 2 2 5dρ2 e d, which leads to Also to illustrate the result we again specialize it to the most favourable case of exactly Ssparse signals with balanced coefficients, that is c S(x) = S 1/2. Assuming white Gaussian noise with variance ρ2 G we see that identification is possible even for expected signal to noise ratios of the order O(S d ), that is E( Φx 2 2) E( r 2 2) S Similarly, by specializing Theorem 3 to the case of exactly S-sparse and noiseless signals we get - to the best of our knowledge - the first result establishing that locally it is possible to stably identify dictionaries from signals with sparsity levels beyond the spark of the generating dictionary. Indeed, even if some of the S-sparse signals could have representations in Φ that require less than S atoms, there will still be a local maximum of the asymptotic criterion close to the original dictionary as long as the smallest coefficient of each signal is of the order O(µ), which in the most favourable case means that we can have S µ 2 or S d. The quality of this result is on a par with the best results for finding sparse approximations in a given dictionary, which say that on average Basis Pursuit or thresholding can find the correct sparse support even for signals with sparsity levels of the order of the ambient dimension (Tropp, 2008; Schnass and Vandergheynst, 2007). Next note that with the available tools it would be possible to consider also a signal model where a small fraction of the coefficients violates the decay condition c S(x) c S+1(x) β and still have stability. However, we leave explorations in that direction to the interested reader and instead turn to the study of the criterion for a finite number of training samples. 5. Finite Sample Size Results In this section we will transform the two asymptotic results from the last section into results for finite sample sizes, that is, we will study when Φ is close to a local maximum of max Ψ D 1 N n=1 max |I|=S Ψ Iyn 1, (12) assuming that the yn are following either the noise-free or the noisy signal model. For convenience we will do the analysis for the normalized version (12) of the S-response criterion (4). Theorem 4 Let Φ be a unit norm frame with frame constants A B and coherence µ. Let the coefficients xn be drawn from a symmetric probability distribution ν on the unit sphere Local Identification of Overcomplete Dictionaries SK 1 RK and assume that the signals are generated as yn = Φxn. If there exists β > 0 such that for c1(xn) c2(xn) . . . c K(xn) 0 the non-increasing rearrangement of the absolute values of the components of xn we have c S(xn) c S+1(xn) 2µ xn 1 β almost surely and the target precision ε satisfies B β( c1+...+ c S) , where ci := Exn(ci(xn)), then except with probability, N ε2( c1 + . . . + c S)2 129S2K2B + Kd log B ε( c1 + . . . + c S) there is a local maximum of (12) respectively (4) at Ψ satisfying d( Ψ, Φ) ε + ε2 Theorem 5 Let Φ be a unit norm frame with frame constants A B and coherence µ. Let the coefficients xn be drawn from a symmetric probability distribution ν on the unit sphere SK 1 RK. Further let rn = (rn(1) . . . rn(d)) be i.i.d. centred random subgaussian noisevectors with parameter ρ and assume that the signals are generated according to the noisy signal model in (9). If there exists β > 0 such that for c1(xn) c2(xn) . . . c K(xn) 0 the non-increasing rearrangement of the absolute values of the components of x we have c S(xn) c S+1(xn) β almost surely and if the target precision ε, the noise parameter ρ and the coherence µ satisfy ε β 9 4 + 9 log a and (13) max{µ, ρ} β p 72(log a + log log a) for a = 150K2S( B + 1) Crβ( c1 + . . . + c S), where Cr = Ern (1 + rn 2 2) 1/2 and ci := Exn(ci(xn)), then except with probability N ε2 µ,ρ( c1 + . . . + c S)2 513C2r S2K2 B + 1 2 + Kd log εµ,ρ( c1 + . . . + c S) where εµ,ρ = max Cr( c1 + . . . + c S) exp β2 72 max{µ2, ρ2} there is a local maximum of (12) respectively (4) at Ψ, satisfying d( Ψ, Φ) εµ,ρ + ε2 µ,ρ 16K . Proof idea The proofs, which can be found in Appendix A.3, are based on three ingredients, a Lipschitz property for the mapping Ψ 1 N PN n=1 max|I|=S Ψ Iyn 1 for the respective signal model, the concentration of the sum around its expectation for a δ-net covering the space of all admissible dictionaries close to Φ and a triangle inequality argument to show that the finite sample response differences are close to the expected response differences and therefore larger than 0 for all ε εµ,ρ. To see better how the sample complexity behaves, we simplify the two theorems to the special case of noiseless exactly S-sparse signals with balanced coefficients for various orders of magnitude of S. If we have S = O(1), Theorem 4 implies that in order to have a maximum within radius ε to the original dictionary Φ with probability e Kd we need N = O K3d ε 2 samples. Conversely given N training signals we can expect the distance between generating dictionary and closest local maximum to be of the order O K2N 1/2 . If we assume a very incoherent dictionary where µ = O(d 1/2) and thus let the sparsity level be of the order O( d) the sample complexity rises to N = O K3d3/2 ε 2 . Taking into account that by (13) the target precision ε needs to be of order O S 1/2 = O d 1/4 this means that we need at least N = O K3d2 training signals and once this initial level is reached, the error goes to zero at rate N 1/2. For an even lower sparsity level, S = O(d), again assuming a very incoherent dictionary, the sample complexity for target precision ε implied now by Theorem 5 rises to N = O K3d2 ε 2 . In this regime, however, we cannot reach arbitrarily small errors by choosing N large enough but only approach the asymptotic precision εµ = 16K2 SB exp ( d/72S). Following these promising theoretical results, in the next section we will finally see how theory translates into practice. 6. Experiments After showing that the optimization criterion in (4) is locally suitable for dictionary identification, in this section we present an iterative thresholding and K means type algorithm (ITKM) to actually find the local maxima of (4) and conduct some experiments to illustrate the theoretical results. We recall that given the input signals Y = (y1 . . . y N) and a fixed sparsity parameter S we want to solve, n max |I|=S Ψ Iyn 1. Using Lagrange multipliers, n max |I|=S Ψ Iyn 1 n:k I(Ψ,yn) sign( ψk, yn )y n, ψk 2 2 = 2ψ k, Local Identification of Overcomplete Dictionaries where I(Ψ, yn) := argmax|I|=S Φ Iyn 1, we arrive at the following update rule, ψnew k = λk X n:k I(Ψold,yn) sign( ψold k , yn )yn, (14) where λk is a scaling parameter ensuring that ψnew k 2 = 1. In practice, when we do not have an oracle giving us the generating dictionary as initialization, we also need to safeguard against bad initializations resulting in a zero-update ψnew k = 0. For example we can choose the zero-updated atom uniformly at random from the unit sphere or from the input signals. Note that finding the sets I(Ψold, yn) corresponds to N thresholding operations while updating according to (14) corresponds to K signed means. Altogether this means that each iteration of ITKM has computational complexity determined by the matrix multiplication Ψ Y , meaning O(d KN). This is light in comparison to K-SVD, which even when using thresholding instead of OMP as sparse approximation procedure still requires the calculation of the maximal singular vector of K on average d N K matrices. It is also more computationally efficient than the algorithm for local dictionary refinement, proposed by Arora et al. (2014), which is also based on averaging. Furthermore it is straightforward to derive online or parallelized versions of ITKM. In an online version for each newly arriving signal yn we calculate I(Ψold, yn) using thresholding and update ψnew k = ψnew k + sign( ψold k , yn )yn for k I(Ψold, yn). After N signals have been processed we renormalize the atoms ψnew k to have unit norm and set Ψold = Ψnew. Similarly, to parallelize we divide the training samples into m sets of size N m. Then on each node m we learn a dictionary Ψnew m according to (14) with λk = 1. We then calculate the sum of these dictionaries Ψnew 0 = P m Ψnew m and renormalize the atoms in Ψnew 0 to have unit norm. Armed with this very simple algorithm we will now conduct four experiments to illustrate our theoretical findings1. 6.1 ITKM vs. K-SVD In our first experiment we compare the local recovery error of ITKM and K-SVD for 3dimensional bases with increasing condition numbers. The bases are perturbations of the canonical basis Φ = (e1, e2, e3) with the vector v = (1, 1, 1). That is, Φt = (et 1, et 2, et 3), where et i = (ei + tv)/ (ei + tv) 2 and t varying from 0 to 0.5 in steps of 0.1, which corresponds to condition numbers κ(Φt) varying from 1 to 2.5. We generate N = 4096 approximately 1-sparse noiseless signals from the signal model described in Table 1 with S = 1, T = 2, ρ = 0 and b = 0.1/0.2 and run both ITKM and K-SVD with 1000 iterations, sparsity parameter S = 1 and the true dictionary (basis) as initialization. Figure 1(a) shows the recovery error d(Φt, Ψ) between the original dictionary and the output of the respective algorithm averaged over 10 runs. As predicted by the theoretical results on the corresponding underlying minimization principles, the recovery errors of ITKM and K-SVD are roughly the same for Φ0, which is an orthogonal basis and therefore tight. However, while for ITKM the recovery error stays 1. A Matlab penknife (mini-toolbox) for playing around with ITKM and reproducing the experiments can be found at http://homepage.uibk.ac.at/~c7021041/ITKM.zip. Signal Model Given the generating dictionary Φ our signal model further depends on four coefficient parameters, S - the effective sparsity or number of comparatively large coefficients, b - deciding the decay factor of these sparse coefficients, T - the total number of non-zero coefficients (T S) and ρ - the noise level. Given these parameters we choose a decay factor cb uniformly at random in the interval [1 b, 1]. We set ci = ci b/ S for 1 i S and ci = 0 for T < i K. If T = S we renormalize the sequence to have unit norm, while if T > S we choose the vector (c S+1, . . . , c T ) uniformly at random on the sphere of radius R, where R is chosen such that the resulting sequence c has unit norm. We then choose a permutation p and a sign sequence σ uniformly at random and set y = Φcp,σ, respectively y = (Φcp,σ + r)/ p 1 + r 2 where r is a Gaussian noise-vector with variance ρ2 if ρ > 0. Table 1: Signal Model constantly low over all condition numbers, for K-SVD it increases with increasing condition number or non-tightness. 6.2 Recovery Error and Sample Size The next experiment is designed to show how fast the maximizer Ψ near the original dictionary Φ converges to Φ with increasing sample size N. The generating dictionaries consist of the canonical basis in Rd for d = 4, 8, 16 and the first d/2 elements of the Hadamard basis and as such are not tight. For every set of parameters d, S(T), b we generate N noiseless signals with N varying from 27 = 128 to 214 = 16384 and run ITKM with 1000 iterations, sparsity parameter S equal to the coefficient parameter S and the true dictionary as initialization. Figure 1(b) shows the recovery error d(Φ, Ψ) between the original dictionary Φ and the output of ITKM Ψ averaged over 10 runs. As predicted by Theorem 4 the recovery error decays as N 1/2. However, the separation of the curves for d = 4, 8, 16 and almost exactly sparse signals (b = 0.01) by a factor around 2 instead of 4, as suggested by the estimate ε K2N 1/2, indicates that the cubic dependence of the sampling complexity on the number of atoms K may be too pessimistic and could be lowered. 6.3 Stability of Recovery Error under Coherence and Noise With the last two experiments we illustrate the stability of the maximization criterion under coherence and noise. As generating dictionaries we use again the canonical basis plus half Hadamard dictionaries described in the last experiment, which have coherence µ = d 1/2.To test the stability under coherence we use a large enough number of noiseless training signals N = 16384, such that the distance between the local maximum of the criterion near the generating dictionary, that is the output of ITKM with oracle initialization, and the generating dictionary is mainly determined by the ratio between the gap size β and the coherence. Local Identification of Overcomplete Dictionaries 1 1.3 1.6 1.9 2.2 2.5 0 condition number ksvd, b=0.1 itkm, b=0.1 ksvd, b=0.2 itkm, b=0.2 7 8 9 10 11 12 13 14 !9 log2(number of signals) log2(error) 4, 1(2), 0.01 8, 1(2), 0.01 16, 1(2), 0.01 4, 1(2), 0.1 8, 1(2), 0.1 16, 1(2), 0.1 8, 2(4), 0.01 16, 2(4), 0.01 Figure 1: (a) Local recovery error of K-SVD and ITKM for two different types of decaying coefficients and bases with varying condition numbers in R3, (b) Decay of recovery error of ITKM with increasing number of training signals For each set of parameters d, S(T) we create N training signals with decreasing gap sizes β by increasing b from 0 to 0.1 in steps of 0.01 and run ITKM with oracle initialization, parameter S and 1000 iterations. Figure 2(a) shows the recovery error d(Φ, Ψ) between the original dictionary Φ and the output of ITKM Ψ again averaged over 10 trials. Again the experiments reflect our theoretical results. For d = 8, 16 with S = 1 or d = 16 with S = 2 the gap size is large enough that over the whole range of parameters the recovery error stays constantly low at the level defined by the number of samples. Note that this is quite good, since for b = 0.1 we are already far beyond the gap size coherence ratios where the stable theoretical results hold. On the other hand for d = 8 with S = 2 or d = 16 with S = 3 early on the gap decreases enough to become the error determining factor and so we see an increase in recovery error as b grows. Conversely to test the stability under noise we use a large enough number of exactly sparse training signals, such that the recovery error will be mainly determined by the noise level. For each set of parameters d, S(S), b we create N = 16384 training signals with Gaussian noise of variance (noise level) ρ2 going from 0 to 0.1 in steps of 0.01 and run ITKM with oracle initialization, parameter S and 1000 iterations. Figure 2(b) shows the recovery error d(Φ, Ψ) between the original dictionary Φ and the output of ITKM Ψ, this time averaged over 20 trials. The curves again correspond to the prediction of the theoretical results, that is the recovery error stays at roughly the same level defined by the number of samples until the noise becomes large enough and then increases. What is maybe interesting to observe in both experiments is the dithering effect for d = 16 with S = 3, which is due to the special structure of the dictionary. Indeed using almost equally sized, almost exactly sparse coefficients, it is possible to build signals using only the canonical basis that have almost the 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0 decrease in gap!size (b) 8, 1(4) 16, 1(4) 8, 2(4) 16, 2(4) 16, 3(6) 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.02 8, 2(2), 0.1 8, 2(2), 0.2 16, 2(2), 0.1 16, 2(2), 0.2 16, 3(3), 0.1 16, 3(3), 0.2 Figure 2: Increase of recovery error with (a) decreasing ratio between coefficient gap and coherence and (b) increasing noise level same response in only half the Hadamard basis and the other way round. This indicates that slight perturbations of one with the other lead to even better responses and therefore a larger recovery error. After showing that the theoretical results translate into algorithmic practice, we finally turn to a discussion of our results in the context of existing work and point out directions of future research. 7. Discussion We have introduced a new response maximization principle for dictionary learning and shown that this is locally suitable to identify a generating µ-coherent dictionary from approximately S-sparse training signals to arbitrary precision as long as the sparsity level is of the order O(µ 1). We have also presented - to the best of our knowledge - the first results showing that stable dictionary identification is locally possible not only for signal to noise ratios of the order O( d) but also for sparsity levels of the order O(µ 2). The derived sample complexity (omitting log factors) of O(K3d ε 2), for signals with sparsity levels S = O(1) is roughly the same as for the K-SVD criterion (Schnass, 2014) or the ℓ1-minimization criterion (Jenatton et al., 2014) but somewhat large compared to recently developed dictionary algorithms that have a sample complexity of O(K2) (Arora et al., 2014; Agarwal et al., 2014a) or O(Kε 2) (Agarwal et al., 2014b). However, as the sparsity approaches and goes beyond µ 1 d the derived sample complexity of O(K3d2 ε 2) compares quite favourably to the sample complexity of O(K1/(4η)) for a sparsity level d1/2 η as projected by Arora et al. (2014). Given that also our experimental results suggest that O(K3d ε 2) is quite pessimistic, one future direction of research aims to lower the sample complexity. In particular ongoing work suggests that for the ITKM algorithm a sample size of order K log K is enough to guarantee local recovery with high probability. Local Identification of Overcomplete Dictionaries Another strong point of the results is that the corresponding maximization algorithm ITKM (Iterative Thresholding and K signed Means) is locally successful, as demonstrated in several experiments, and computationally very efficient. The most complex step in each iteration is the matrix multiplication Φ Y of order O(d KN), which is even lighter than the iterative averaging algorithm described by Arora et al. (2014). However, the serious drawback is that ITKM is only a local algorithm and that all our results are only local. Also while for the K-SVD criterion and the ℓ1-minimization criterion there is reason to believe that all local minima might be equivalent, the response maximization principle has a lot of smaller local maxima, which is confirmed by preliminary experiments with random initializations. There ITKM fails but with grace, that is, it outputs local maximizers that have not all, but only most atoms in common with what seems to be the global maximizer near the generating dictionary. This behaviour is in strong contrast to the algorithms presented by Arora et al. (2014); Agarwal et al. (2014b), that have global success guarantees at a computational cost of the order O(d N2), and leads to several very important research directions. First we want to confirm that ITKM has a convergence radius of the order O(1/ S). This is suggested by the derived radius of the area on which the generating dictionary is the optimal maximizer as well as preliminary experiments. Alternatively, we could investigate how the results for the local iterative algorithms (Arora et al., 2014; Agarwal et al., 2014a) could be extended to larger sparsity levels and convergence radii using our techniques. The associated important question is how to extend the results for the algorithms presented by Arora et al. (2014); Agarwal et al. (2014b) to sparsity levels O(µ 2), if possible at lower cost than O(d N2). Given the conjectured size of the convergence radius for ITKM it would even be sufficient for the output of the algorithm to arrive at a dictionary within distance O(1/ S) to the generating dictionary, since the output could then be used as initialization for ITKM. A parallel approach for getting global identification results for sparsity levels O(µ 2), that we are currently pursuing, is to analyze a version of ITKM using residual instead of pure signal means, which in preliminary experiments exhibits global convergence properties. The last research directions we want to point out are concerned with the realism of the signal model. The fact that for an input sparsity S a gap of order O(µ 2) between the S and S + 1 largest coefficient is sufficient can be interpreted as a relaxed dependence of the algorithm on the sparsity parameter, since a gap of order µ 2 can occur quite frequently. To further decrease this sensitivity to the sparsity parameter in the criterion and the algorithm we would therefore like to extend our results to the case where we can only guarantee a gap of order O(µ 2) between the S largest and the S + T largest coefficient for some T > 1. Last but not least we would like to exactly reflect the practical situation, where we would normalize our training signals to equally weight their importance and analyze the unit norm signal model where y = Φx + r/ Φx + r 2. Acknowledgments This work was supported by the Austrian Science Fund (FWF) under Grant no. J3335 and improved thanks to the reviewers comments and suggestions. Thanks also go to the Computer Vision Laboratory of the University of Sassari, Italy, which provided the surroundings, where almost all of the presented work was done, and to Michael Dymond, who looked over the final manuscript with a native speaker s eye. Appendix A. Proofs We now provide the full proofs for all our results. A.1 Proof of Theorem 2 We first reformulate and prove the theorem for the simple case of a symmetric coefficient distribution based on one sequence and then use an integration argument to extend it to the general case. Proposition 6 Let Φ be a unit norm frame with frame constants A B and coherence µ. Let x RK be a random permutation of a sequence c, where c1 c2 c3 . . . c K 0 and c 2 = 1, provided with random signs, that is x = cp,σ with probability P(p, σ) = (2KK!) 1. Assume that the signals are generated as y = Φx. If c satisfies c S > c S+1 + 2µ c 1 then there is a local maximum of (5) at Φ. Moreover for Ψ = Φ we have Ey max|I|=S Ψ Iy 1 < Ey max|I|=S Φ Iy 1 as soon as d(Φ, Ψ) c S c S+1 2µ c 1 B (c S c S+1 2µ c 1)(c1+...+c S) . (15) Proof We start by evaluating the objective function at the original dictionary Φ. max |I|=S Φ Iy 1 max |I|=S Φ IΦcp,σ 1 i I | φi, Φcp,σ | To estimate the sum of the (in absolute value) largest S inner products, we first assume that p is fixed. Setting Ip = p 1 ({1, . . . S}) we have, | φi, Φcp,σ | = σicp(i) + X j =i σjcp(j) φi, φj c S µ c 1 i Ip c S+1 + µ c 1 i / Ip . Together with the condition that c S > c S+1 + 2µ c 1 these estimates ensure that the S maximal inner products in absolute value are attained at Ip and so we get for the expectation, max |I|=S Φ IΦcp,σ 1 = Ep Eσ Φ IpΦcp,σ 1 cp(i) + σi X j =i σjcp(j) φi, φj = c1 + . . . + c S. To compute the expectation for a perturbation of the original dictionary we use the following parameterization of all ε-perturbations Ψ of the original dictionary Φ. If d(Ψ, Φ) = ε then Local Identification of Overcomplete Dictionaries ψi φi 2 = εi with maxi εi = ε and we have zi with φi, zi = 0, zi 2 = 1 and αi := 1 ε2 i /2 and ωi := (ε2 i ε4 i /4) 1 2 , such that ψi = αiφi + ωizi. Expanding the expectation as before we get, max |I|=S Ψ Iy 1 max |I|=S Ψ IΦcp,σ 1 i I | ψi, Φcp,σ | The tried and tested strategy applied now is showing that for small perturbations and most sign patterns σ the maximal inner products are still attained by i Ip. We have i Ip : | ψi, Φcp,σ | αi(c S µ c 1) ωi| zi, Φcp,σ | i / Ip : | ψi, Φcp,σ | αi(c S+1 + µ c 1) + ωi| zi, Φcp,σ |. Using Hoeffding s inequality we can estimate the typical sizes of the terms | zi, Φcp,σ |, P(| zi, Φcp,σ | t) = P(| X j =i σjcp(j) zi, φj | > t) 2 P j =i c2 p(j) zi, φj 2 In case ωi = 0 or equivalently εi = 0, we set t = s/ωi to arrive at P(ωi| zi, Φcp,σ | s) 2 exp s2 where we have used that ω2 i = ε2 i ε4 i /4 ε2 i , while in case εi = 0 we trivially have that P(ωi| zi, Φcp,σ | s) = 0. Summarizing these findings we see that except with probability i:εi =0 exp s2 i Ip : | ψi, Φcp,σ | αi(c S µ c 1) s i / Ip : | ψi, Φcp,σ | αi(c S+1 + µ c 1) + s. This means that as long as mini Ip αi(c S µ c 1) s maxi/ Ip αi(c S+1 +µ c 1)+s, which is for instance implied by setting s := 1 2(c S c S+1 2µ c 1 ε2 2 ), we have max |I|=S Ψ IΦcp,σ 1 = Ψ IpΦcp,σ 1. We now use this result for the calculation of the expectation over σ in (16). For any permutation p we define the set i {σ s.t. ωi| zi, Φcp,σ | s}. We then have max |I|=S Ψ IΦcp,σ 1 σ Σp P(σ) max |I|=S Ψ IΦcp,σ 1 + X σ / Σp P(σ) Ψ IpΦcp,σ 1 σ Σp P(σ) max |I|=S Ψ IΦcp,σ 1 Ψ IpΦcp,σ 1 + Eσ Ψ IpΦcp,σ 1 . (17) To estimate the sum over Σp, note that we have the following bounds: | ψi, Φcp,σ | = |αi φi, Φcp,σ + ωi zi, Φcp,σ | 2 )| φi, Φcp,σ | + ε 2 )| φi, Φcp,σ | ε max |I|=S Ψ IΦcp,σ 1 (1 ε2 2 ) max |I|=S Φ IΦcp,σ 1 + S ε 2 ) Φ IpΦcp,σ 1 + S ε Ψ IpΦcp,σ 1 (1 ε2 2 ) Φ IpΦcp,σ 1 S ε Substituting these estimates into (17) we get max |I|=S Ψ IΦcp,σ 1 σ Σp P(σ) 2εS B + Eσ Ψ IpΦcp,σ 1 B + Eσ Ψ IpΦcp,σ 1 . Next we calculate Eσ Ψ IpΦcp,σ 1 : Eσ Ψ IpΦcp,σ 1 = Eσ i Ip | ψi, Φcp,σ | αicp(i) + σi αiφi + ωizi, X j =i σjcp(j)φj i Ip αicp(i), (18) where have used that ε ((1 ε2 2 )c S µ c 1)/ B guarantees the expression within absolute values in (18) to always be positive. Collecting all these results we arrive at the following estimate for the value of the objective function at Ψ: max |I|=S Ψ Iy 1 max |I|=S Ψ IΦcp,σ 1 i:εi =0 exp (c S c S+1 2µ c 1 ε2 i Ip αicp(i) i:εi =0 exp (c S c S+1 2µ c 1 ε2 + c1 + . . . + c S Local Identification of Overcomplete Dictionaries Finally we are able to compare the expectation at the original dictionary to that at an ε-perturbation. Remembering that αi = 1 ε2 i 2 , we get max |I|=S Φ Iy 1 max |I|=S Ψ Iy 1 c1 + . . . + c S i:εi =0 exp (c S c S+1 2µ c 1 ε2 ε2 c1 + . . . + c S (c S c S+1 2µ c 1 ε2 Thus to have a local maximum at the original dictionary we need that B c1 + . . . + c S exp (c S c S+1 2µ c 1 ε2 and all that remains to be shown is that this is implied by (15). Since K 2, (15) implies that ε2 2 < c S c S+1 2µ c 1 2(1+3 log 96)2 c S c S+1 2µ c 1 100 and it suffices to show that (15) further implies B c1 + . . . + c S exp (c S c S+1 2µ c 1)2 992 Applying Lemma A.3 by Schnass (2014), which says that for a, b, ε > 0, 1 + 16 log(a b) implies that a exp b2 to the situation at hand, where a = 8SK2 B c1+...+c S and b = (c S c S+1 2µ c 1) 99 8 100 , we get that (19) is ensured by ε < c S c S+1 2µ c 1 99 e1/16SK2 B (c S c S+1 2µ c 1)(c1+...+c S) which simplifies to (15). Proof [of Theorem 2] Using the symmetry of ν, our strategy is to reduce the general to the simple coefficient model. Let c denote the mapping that assigns to each x SK 1 the non increasing rearrangement of the absolute values of its components, that is ci(x) = |xp(i)| for a permutation p such that c1(x) c2(x) . . . c K(x) 0. Then the mapping c together with the probability measure ν on SK 1 induces a pull-back probability measure νc on c(SK 1), by νc(Ω) := ν(c 1(Ω)) for any measurable set Ω c(SK 1). With the help of this new measure we can rewrite the expectations we need to calculate as max |I|=S Φ Iy 1 max |I|=S Φ IΦx 1 x max |I|=S Φ IΦx 1dν = Z max |I|=S Φ Φcp,σ(x) 1 The expectation inside the integral should seem familiar. Indeed we have calculated it already in the proof of Proposition 6 for c(x) a fixed decaying sequence satisfying c S(x) > c S+1(x) + 2µ x 1. Since this property is satisfied almost surely we have max |I|=S Φ Iy 1 max |I|=S Φ Φcp,σ(x) 1 c(x) c1(x) + . . . + c S(x)dνc := c1 + . . . + c S. For the expectation of a perturbed dictionary Ψ we get in analogy max |I|=S Ψ Iy 1 max |I|=S Ψ Φcp,σ(x) 1 c(x) η(x) + (c1(x) + . . . + c S(x)) 1 η(x) := 4εS i:εi =0 exp (c S(x) c S+1(x) 2µ x 1 ε2 Since c S(x) c S+1(x) 2µ x 1 β almost surely we have i:εi =0 exp almost surely and therefore max |I|=S Ψ Iy 1 ηβ + ( c1 + . . . + c S) 1 Following the same argument as in the proof of Proposition 6 we see that Ey max|I|=S Φ Iy 1 > Ey max|I|=S Ψ Iy 1 as soon as B β( c1+...+ c S) . Local Identification of Overcomplete Dictionaries A.2 Proof of Theorem 3 Again we first reformulate and prove the theorem for the case of a symmetric coefficient distribution based on one sequence and then extend it with an integration argument. Proposition 7 Let Φ be a unit norm frame with frame constants A B and coherence µ. Let x RK be a random permutation of a sequence c, where c1 c2 c3 . . . c K 0 and c 2 = 1, provided with random signs, that is x = cp,σ with probability P(p, σ) = (2KK!) 1. Further let r = (r(1) . . . r(d)) be a centred random subgaussian noise-vector with parameter ρ and assume that the signals are generated according to the noisy signal model in (9). If we have max{µ, ρ} c S c S+1 p 72(log a + log log a) for a = 112K2S( B + 1) Cr(c S c S+1)(c1 + . . . + c S), (20) where Cr = Er (1 + r 2 2) 1/2 , then there is a local maximum of (5) at Ψ satisfying d( Ψ, Φ) 12SK2 B Cr(c1 + . . . + c S) exp (c S c S+1)2 72 max{µ2, ρ2} Proof To prove the proposition we digress from the conventional scheme of first calculating the expectation of our objective function for both the original and a perturbed dictionary and then comparing and instead bound the difference of the expectations directly. max |I|=S Φ Iy 1 max |I|=S Ψ Iy 1 Φ I(Φcp,σ + r) p 1 max |I|=S Ψ I(Φcp,σ + r) p max|I|=S Φ I(Φcp,σ + r) 1 max|I|=S Ψ I(Φcp,σ + r) 1 p := Ep,σ,r( p,σ,r) Again our strategy is to show that for a fixed p for most σ and r the maximal response of both the original dictionary and the perturbation is attained at Ip. The expressions we therefore need to lower (upper) bound for i Ip (i / Ip) are | φi, Φcp,σ + r | = σicp(i) + X j =i σjcp(j) φi, φj + φi, r , | ψi, Φcp,σ + r | = αiσicp(i) + αi X j =i σjcp(j) φi, φj + ωi zi, Φcp,σ + ψi, r . However, instead of using a worst case estimate for the gap between the responses of the original dictionary inside and outside Ip, we now make use of the fact that for most sign sequences we have a gap size of order c S c S+1. This means that as soon as | P j =i σjcp(j) φi, φj |, ωi| zi, Φcp,σ | and the noise related terms | φi, r | and | ψi, r | are of order (c S c S+1) the maximal response of both the original dictionary and the perturbation is attained at Ip. In particular, defining the sets j =i σjcp(j) φi, φj c S c S+1 6 or ωi| zi, Φcp,σ | c S c S+1 3ε2 for a fixed permutation p and r s.t. | φi, r | c S c S+1 3 or | ψi, r | c S c S+1 we see that both maxima are attained at Ip as long as σ / Σp and r / R. Using Hoeffding s inequality we get that j =i σjcp(j) φi, φj > t 2 P j =i c2 p(j)| φi, φj |2 while from the proof of Proposition 6 we know that for εi = 0 we have P(ωi| zi, Φcp,σ | s) 2 exp s2 . Setting t = (c S c S+1)/6, s = (c S c S+1 3ε2 2 )/6 and using a union bound then leads to P(Σp) 2K exp c S c S+1 3ε2 + 2K exp (c S c S+1)2 Since the r(i) are subgaussian with parameter ρ we have for any v = (v1 . . . vd) and t 0, P(| v, r | t) exp t2 2ρ2 v 2 2 (Vershynin, 2012). Taking a union bound over all φi, ψi with the corresponding choice for t then leads to the estimate P(R) 2K exp c S c S+1 2 c S c S+1 2 We now split the expectations over the sign and noise patterns for a fixed p to get EσEr( p,σ,r) = Eσ r/ R p,σ,rdνr r R p,σ,rdνr σ / Σp P(σ) Z r/ R p,σ,rdνr + X σ Σp P(σ) Z r/ R p,σ,rdνr r R p,σ,rdνr Next note that Σp is symmetric in the sense that we either have (σ1, . . . , σi, . . . , σK) Σp or (σ1, . . . , σi, . . . , σK) / Σp. Thus we get for the first term in (23), σ / Σp P(σ) Z r/ R p,σ,rdνr = Z σ / Σp P(σ) Φ Ip(Φcp,σ + r) 1 Ψ Ip(Φcp,σ + r) 1 p σ / Σp P(σ) P i Ip cp(i) ε2 i 2 p Local Identification of Overcomplete Dictionaries To bound the last two terms in (23) we first find an upper bound formax|I|=S Ψ I(Φcp,σ + r) 1: max |I|=S Ψ I(Φcp,σ + r) 1 = max |I|=S i I | αiφi + ωizi, Φcp,σ + r) | 1 ε2 i 2 | φi, Φcp,σ + r) | + εi Φcp,σ + r 2 2 | φi, Φcp,σ + r) | + ε 2 max |I|=S Φ I(Φcp,σ + r) 1 + εS This then leads to the following lower bound for p,σ,r: p,σ,r (1 + r 2 2) 1/2 max |I|=S Φ I(Φcp,σ + r) 1 ε2 (1 + r 2 2) 1/2 Φ Ip(Φcp,σ + r) 1 ε2 Using again the symmetry of Σp we have σ Σp P(σ) Z r/ R p,σ,rdνr Z σ Σp P(σ) Φ Ip(Φcp,σ + r) 1 ε2 1 + r 2 2 dνr P i Ip cp(i) ε2 1 + r 2 2 εS P i Ip cp(i) ε2 i 2 p 1 + r 2 2 εS and similarly r R p,σ,rdνr Φ Ip(Φcp,σ + r) 1 ε2 P i Ip cp(i) ε2 i 2 p 1 + r 2 2 εS Resubstituting into (23) we get EσEr( p,σ,r) Z P i Ip cp(i) ε2 i 2 p 1 + r 2 2 εS P i Ip cp(i) ε2 i 2 p 1 + r 2 2 εS σ / Σp P(σ) P i Ip cp(i) ε2 i 2 p P i Ip cp(i) ε2 i 2 p 1 + r 2 2 dνr εS B + 1 P(R) + P(Σp) . (24) Taking the expectation over the permutations then yields Ep,σ,r( p,σ,r) Er Ep P i Ip cp(i) ε2 i 2 p B + 1 P(R) + Ep P(Σp) ! c1 + . . . + c S B + 1 P(R) + Ep P(Σp) . Using the probability estimates from (21)/(22) we see that Ep,σ,r( p,σ,r) > 0 is implied by where we have used the abbreviations γ = c1 + . . . + c S, β = c S c S+1 and Cr = Er (1 + r 2 2) 1/2 . We now proceed by splitting the above condition. We define εmin by asking that 72 max{µ2, ρ2} and εmax implicitly by asking that Following the line of argument in the proof of Proposition 6 we see that the above condition is guaranteed as soon as log 112K2S( B+1) Crβγ := εmax. Local Identification of Overcomplete Dictionaries The statement follows from making sure that εmin < εmax. Proof [of Theorem 3] Using the pull-back probability measure νc we can write max |I|=S Φ Iy 1 max |I|=S Ψ Iy 1 c(x) Ep,σ,r p,σ,r,c(x) dνc, where p,σ,r,c(x) is defined analogue to p,σ,r in the last proof, that is replacing c by c(x). The statement follows from employing the lower estimate for Ep,σ,r p,σ,r,c(x) from (24) and replacing c1 +. . .+c S by c1 +. . .+ c S resp. c S c S+1 by its lower bound β in the proof of Proposition 7. A.3 Proof of Theorems 4 and 5 Since the proofs of Theorems 4 and 5 are conceptually equivalent we will combine them into one and just split the argument for the inevitable juggling of constants. Proof As outlined in the proof idea we need a Lipschitz property for the mapping Ψ 1 N PN n=1 max|I|=S Ψ Iyn 1 for both signal models, the concentration of the sum around its expectation for a δ net covering the space of all admissible dictionaries close to Φ and a triangle inequality argument to get to the final statement. To show the Lipschitz property we use a reverse triangle inequality: max |I|=S Ψ Iyn 1 max |I|=S Ψ Iyn 1 = max |I|=S Ψ Iyn ( Ψ I Ψ I)yn 1 max |I|=S Ψ Iyn 1 max |I|=S ( Ψ I Ψ I)yn 1 S max k ψk ψk 2 yn 2 Note that for the noise-free signal model we can replace B in the last expression. By averaging over n we get that the mapping in question is Lipschitz with constant S B + 1 in the noisy and S B in the noise-free case, that is 1 N n=1 max |I|=S Ψ Iyn 1 1 n=1 max |I|=S Ψ Iyn 1 To show that the averaged sums concentrate around their expectations we use our favourite tool Hoeffding s inequality. Set Xn = max|I|=S Φ Iyn 1 max|I|=S Ψ Iyn 1, then we have |Xn| εS B + 1 , resp. |Xn| εS B in the noise-free case, and get the estimate max |I|=S Φ Iyn 1 max |I|=S Ψ Iyn 1 E max |I|=S Φ Iy1 1 max |I|=S Ψ Iy1 1 Next we need to choose a δ-net for all perturbations Ψ with d(Φ, Ψ) εmax, that is a finite set of perturbations N such that for every Ψ we can find Ψ N with d(Ψ, Ψ) δ. Recalling the parameterization of all ε-perturbations from the proof of Proposition 6, we see that the space we need to cover is included in the product of K balls with radius εmax in dimension d. For instance from Lemma 2 by Vershynin (2012) we know that for the d dimensional ball of radius εmax we can find a δ-net Nd satisfying Nd εmax + 2εmax δ d, so for our space of ε-perturbations we can find a δ-net N satisfying N εmax + 2εmax Taking a union bound we can now estimate the probability that we have concentration for all perturbations in the net as max |I|=S Φ Iyn 1 max |I|=S Ψ Iyn 1 E max |I|=S Φ Iy1 1 max |I|=S Ψ Iy1 1 Finally we are ready for the triangle inequality argument. For any Ψ with d(Ψ, Φ) = ε εmax we can find Ψ N with d( Ψ, Ψ) δ and d(Φ, Ψ) = ε and therefore get n=1 max |I|=S Φ Iyn 1 1 n=1 max |I|=S Ψ Iyn 1 n=1 max |I|=S Φ Iyn 1 E max |I|=S Φ Iy1 1 + E max |I|=S Φ Iy1 1 E max |I|=S Ψ Iy1 1 + E max |I|=S Ψ Iy1 1 n=1 max |I|=S Ψ Iyn 1 + 1 n=1 max |I|=S Ψ Iyn 1 1 n=1 max |I|=S Ψ Iyn 1 E max |I|=S Φ Iy1 1 E max |I|=S Ψ Iy1 1 Depending on the signal model we now have to substitute the values for the asymptotic differences E max|I|=S Φ Iy1 1 E max|I|=S Ψ Iy1 1 calculated in the previous proofs. Under the conditions given in Theorem 4 we have, n=1 max |I|=S Φ Iyn 1 1 n=1 max |I|=S Ψ Iyn 1 ε2 c1 + . . . + c S Local Identification of Overcomplete Dictionaries To make sure that the above expression is larger than zero, we split it into two conditions. The first condition ε2 c1 + . . . + c S is satisfied as soon as B β( c1+...+ c S) . To concretise the second condition ε2 c1 + . . . + c S we choose t = ε2 c1+...+ c S 16K and δ = ε2 c1+...+ c S B to arrive at ε 2 ε. Given that ε differs at most by δ from ε we see that (25) is larger than zero except with probability N ε4( c1 + . . . + c S)2 128ε2max S2K2B + Kd log B ε2( c1 + . . . + c S) εmin := ε + ε2 8K ε εmax β B β( c1+...+ c S) ε2 While for the asymptotic results we tried to make εmax as large as possible to indicate how large the basin of attraction could be, for the finite sample size results we want it as small as possible in order to keep the sampling complexity small and therefore choose εmax = εmin. The statement then follows from making sure that the right most inequality in (26) is satisfied and simplifications. In case of the noisy signal model, that is under the conditions given in Theorem 5, we have n=1 max |I|=S Φ Iyn 1 1 n=1 max |I|=S Ψ Iyn 1 ε2 c1 + . . . + c S 2Cr K 2t δS B + 1 2 εSK Splitting equally gives us four conditions: c1 + . . . + c S exp β2 c1 + . . . + c S exp β2 ε2 8Cr K c1 + . . . + c S 64) > 16Cr SK2 c1 + . . . + c S exp Choosing t = ε2 µ,ρ c1+...+ c S 32Cr K and δ = ε2 µ,ρ c1+...+ c S 16KS( B+1) we can merge the first three conditions to ε ε2 µ,ρ, while following the usual argument, Condition (28) is satisfied once βCr( c1+...+ c S) Given that ε differs at most by δ from ε we see that (27) is larger than zero except with probability N ε4 µ,ρ( c1 + . . . + c S)2 512ε2max C2r S2K2 B + 1 2 + Kd log ε2µ,ρ( c1 + . . . + c S) εmin := εµ,ρ + ε2 µ,ρ 16K ε εmax β βCr( c1+...+ c S) Again the statement follows from choosing εmax = εmin, making sure that the right most inequality in (29) is satisfied and simplifications. A. Agarwal, A. Anandkumar, P. Jain, P. Netrapalli, and R. Tandon. Learning sparsely used overcomplete dictionaries via alternating minimization. COLT 2014 (ar Xiv:1310.7991), 2014a. A. Agarwal, A. Anandkumar, and P. Netrapalli. Exact recovery of sparsely used overcomplete dictionaries. COLT 2014 (ar Xiv:1309.1952), 2014b. M. Aharon, M. Elad, and A.M. Bruckstein. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing., 54 (11):4311 4322, November 2006. Local Identification of Overcomplete Dictionaries S. Arora, R. Ge, and A. Moitra. New algorithms for learning incoherent and overcomplete dictionaries. COLT 2014 (ar Xiv:1308.6273), 2014. E. Cand es, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489 509, 2006. O. Christensen. An Introduction to Frames and Riesz Bases. Birkh auser, 2003. D.L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4): 1289 1306, 2006. D.L. Donoho, M. Elad, and V.N. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory, 52(1): 6 18, January 2006. D.J. Field and B.A. Olshausen. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381:607 609, 1996. Q. Geng, H. Wang, and J. Wright. On the local correctness of ℓ1-minimization for dictionary learning. ar Xiv:1101.5672, 2011. P. Georgiev, F.J. Theis, and A. Cichocki. Sparse component analysis and blind source separation of underdetermined mixtures. IEEE Transactions on Neural Networks, 16(4): 992 996, 2005. A. Gersho and R.M. Gray. Vector Quantization and Signal Compression. Springer, 1992. R. Gribonval and K. Schnass. Dictionary identifiability - sparse matrix-factorisation via l1minimisation. IEEE Transactions on Information Theory, 56(7):3523 3539, July 2010. R. Gribonval, R. Jenatton, F. Bach, M. Kleinsteuber, and M. Seibert. Sample complexity of dictionary learning and other matrix factorizations. ar Xiv:1312.3790, 2013. D. Hsu, S.M. Kakade, and T. Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability (ar Xiv:1110.2842), 17(14), 2012. R. Jenatton, F. Bach, and R. Gribonval. Sparse and spurious: dictionary learning with noise and outliers. ar Xiv:1407.5155, 2014. K. Kreutz-Delgado and B.D. Rao. FOCUSS-based dictionary learning algorithms. In SPIE 4119, 2000. K. Kreutz-Delgado, J.F. Murray, B.D. Rao, K. Engan, T. Lee, and T.J. Sejnowski. Dictionary learning algorithms for sparse representation. Neural Computations, 15(2):349 396, 2003. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19 60, 2010. A. Maurer and M. Pontil. K-dimensional coding schemes in Hilbert spaces. IEEE Transactions on Information Theory, 56(11):5839 5846, 2010. N.A. Mehta and A.G. Gray. On the sample complexity of predictive sparse coding. ar Xiv:1202.4050, 2012. M.D. Plumbley. Dictionary learning for ℓ1-exact sparse coding. In M.E. Davies, C.J. James, and S.A. Abdallah, editors, International Conference on Independent Component Analysis and Signal Separation, volume 4666, pages 406 413. Springer, 2007. R. Rubinstein, A. Bruckstein, and M. Elad. Dictionaries for sparse representation modeling. Proceedings of the IEEE, 98(6):1045 1057, 2010. K. Schnass. On the identifiability of overcomplete dictionaries via the minimisation principle underlying K-SVD. Applied and Computational Harmonic Analysis, 37(3):464 491, 2014. K. Schnass and P. Vandergheynst. Average performance analysis for thresholding. IEEE Signal Processing Letters, 14(11):828 831, 2007. K. Schnass and P. Vandergheynst. Dictionary preconditioning for greedy algorithms. IEEE Transactions on Signal Processing, 56(5):1994 2002, 2008. K. Skretting and K. Engan. Recursive least squares dictionary learning algorithm. IEEE Transactions on Signal Processing, 58(4):2121 2130, April 2010. D. Spielman, H. Wang, and J. Wright. Exact recovery of sparsely-used dictionaries. In COLT 2012 (ar Xiv:1206.5882), 2012. J.A. Tropp. On the conditioning of random subdictionaries. Applied and Computational Harmonic Analysis, 25(1-24), 2008. D. Vainsencher, S. Mannor, and A.M. Bruckstein. The sample complexity of dictionary learning. Journal of Machine Learning Research, 12(3259-3281), 2011. R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and G. Kutyniok, editors, Compressed Sensing, Theory and Applications, chapter 5. Cambridge University Press, 2012. M. Yaghoobi, T. Blumensath, and M.E. Davies. Dictionary learning for sparse approximations with the majorization method. IEEE Transactions on Signal Processing, 57(6): 2178 2191, June 2009. M. Zibulevsky and B.A. Pearlmutter. Blind source separation by sparse decomposition in a signal dictionary. Neural Computations, 13(4):863 882, 2001.