# dictionary_learning_based_on_sparse_distribution_tomography__ca92150e.pdf Dictionary Learning Based on Sparse Distribution Tomography Pedram Pad * 1 Farnood Salehi * 2 Elisa Celis 2 Patrick Thiran 2 Michael Unser 1 We propose a new statistical dictionary learning algorithm for sparse signals that is based on an α-stable innovation model. The parameters of the underlying model that is, the atoms of the dictionary, the sparsity index α and the dispersion of the transform-domain coefficients are recovered using a new type of probability distribution tomography. Specifically, we drive our estimator with a series of random projections of the data, which results in an efficient algorithm. Moreover, since the projections are achieved using linear combinations, we can invoke the generalized central limit theorem to justify the use of our method for sparse signals that are not necessarily α-stable. We evaluate our algorithm by performing two types of experiments: image inpainting and image denoising. In both cases, we find that our approach is competitive with stateof-the-art dictionary learning techniques. Beyond the algorithm itself, two aspects of this study are interesting in their own right. The first is our statistical formulation of the problem, which unifies the topics of dictionary learning and independent component analysis. The second is a generalization of a classical theorem about isometries of ℓp-norms that constitutes the foundation of our approach. 1. Introduction The problem of finding the mixing matrix A from a set of observation vectors y in the model is only solvable if one can benefit from strong hypotheses on the signal vector x. For instance, one may assume that *Equal contribution 1Biomedical Imaging Group, EPFL, Lausanne, Switzerland 2Computer Communications and Applications Laboratory 3, EPFL, Lausanne, Switzerland. Correspondence to: Pedram Pad . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). the entries of x are statistically independent, which results in a class of methods refered to as independent component analysis (ICA) (Hyvärinen et al., 2004). A more recent trend is to assume that the vector x is sparse, so that the recovery can be recast as a deterministic dictionary learning problem, the prototypical example being sparse component analysis (SCA) (Gribonval & Lesage, 2006; Aharon et al., 2006; Spielman et al., 2012). Extensive research has been conducted on these problems in the past three decades. Prior work: In the literature, ICA precedes SCA and can be traced back to (Herault & Jutten, 1986). In fact, ICA constitutes the non-Gaussian generalization of the much older principal component analysis (PCA), which is widely used in classical signal processing. ICA is usually formalized as an optimization problem involving a cost function that measures the independence of the estimated xi (i.e., the entries of the vector x). A common measure of independence, which is inspired by information theory, is the mutual information of the entries of x. However, due to its computational complexity, other measures such as the kurtosis, which measures the non-Gaussianity of the components, are often used (Hyvärinen & Oja, 2000; Naik & Kumar, 2011) (except in special cases such as the analysis of stable AR(1) processes by (Pad & Unser, 2015)). The main drawback of ICA is that the system (1) needs to be determined; i.e., A should be square otherwise the complexity is so high that the methods can only be implemented for small problems (Lathauwer et al., 2007; Lathauwer & Castaing, 2008). SCA, on the other hand, is usually achieved by putting constraints on the sparsity of the representation or by optimizing a sparsity-promoting cost function. Thanks to the emergence of very efficient algorithms, SCA has found wide use in different applications (see (Mairal et al., 2010; Marvasti et al., 2012)). The underlying framework for SCA is deterministic this is the primary difference with ICA, which aims to decouple signals that are realizations of stochastic processes. α-stable distributions: In this paper, we aim to achieve the best of both worlds: the use of a statistical formulation in the spirit of ICA with a restriction to a parametric class of stochastic models that is well adapted to the notion of sparsity. Specifically, we assume that the entries of the vector x are random variables that are i.i.d. symmetric-α- Dictionary Learning Based on Sparse Distribution Tomography (a) α = 1 0 25 50 75 100 -50 (b) α = 1.4 Sample Index 0 25 50 75 100 Sample Value Figure 1. Illustration of the effect of α on the sparsity of the signal. Two realizations of i.i.d. SαS signals. stable. The family of α-stable distributions is a generalization of the Gaussian probability density function (PDF). Since α-stability is preserved through linear transformation, this class of models has a central position in the study of stochastic processes (Samoradnitsky & Taqqu, 1994; Nikias & Shao, 1995; Shao & Nikias, 1993). The family is parametrized by α (0, 2], which controls the rate of decay of the distribution. The extreme case of α = 2 corresponds to the Gaussian distribution the only non-sparse member of the family. By contrast, the other members of the SαS family for α < 2 are heavy-tailed with unbounded variance. This property implies that an i.i.d. sequence of such random variables generates a sparse signal (Amini et al., 2011; Gribonval et al., 2012). By decreasing α, the distribution becomes more heavy-tailed and thus the signal becomes more sparse (the effect of α is illustrated in Figure 1). This class of random variables has also been widely used in practice. Typical applications include: modeling of ultrasound RF signals, (Achim et al., 2015), signal detection theory (Kuruoglu et al., 1998), communications (Middleton, 1999), image processing (Achim & Kuruoglu, 2005), audio processing (Georgiou et al., 1999), sea surface (Gallagher, 2001), network traffic (Resnick, 1997), and finance (Nolan, 2003; Ling, 2005). Main contributions: Our main contribution in this paper is a new dictionary learning algorithm based on the signal modeling mentioned above. The proposed method has the following advantages: 1. all parameters can be estimated from the data (it is hyperparameter-free), 2. it learns the dictionary without the need to recover the signal x, and 3. it is fast and remarkably robust. Once the matrix A is estimated, it is then possible to efficiently recover x by using standard procedures (Bickson & Guestrin, 2010). We also show that the proposed algorithm provides an efficient estimator of the spectral measure of a stable random vector. An enabling component of our method is a new theorem that generalizes a classical result about isometries of ℓp-norms. Organization: In the next section, we briefly review SαS random variables and present our mathematical model. In Section 3, we establish our main result which then yields an algorithm for finding the matrix A as well as the sparsity index α. In Section 4, we present the simulation results and compare their performance with existing algorithms. In Section 5, we summarize the paper and give some suggestions for future work. 2. Preliminaries and problem formulation We begin by recalling some basic properties of symmetricα-stable random variables. We then proceed with the formulation of the estimation problem that we solve in Section 3. The notation that we use throughout the paper is as follows: we use italic symbols for random variables, capital boldface symbols for matrices and lowercase boldface symbols for vectors. Thus, X is a deterministic matrix, X is a random matrix and x is a random variable. Likewise, x and x denote a random and a deterministic vector respectively. 2.1. Symmetric-α-stable random variables For any α (0, 2] and γ > 0, a random variable X with characteristic function bp X(ω) = exp( γ|ω|α) (2) is called a symmetric-α-stable (SαS) random variable with dispersion γ and stability parameter α (Nikias & Shao, 1995). This class of random variables is a generalization of the Gaussian model: For α = 2, X is a Gaussian random variable with zero mean and variance 2γ. As their name suggests, α-stable variables share the property of stability under linear combination (Nikias & Shao, 1995); i.e., if X1, . . . , Xn are n i.i.d. copies of X and a1, . . . , an R are n real numbers, then the random variable Y = a1X1 + + an Xn (3) has the same distribution as |a1|α + + |an|α 1 In other words, the random variable Y is an SαS random variable with dispersion γ a α α where a α = |a1|α + α is the α-(pseudo)norm of the vector a = (a1, . . . , an). This property makes SαS random variables convenient for the study of linear systems. Dictionary Learning Based on Sparse Distribution Tomography The other property of SαS random variables with α < 2 is their heavy-tailed PDF. When α < 2, we have lim |x| |x|1+αp X(x) = C(α, γ), (5) where p X is the PDF of X and C(α, γ) is a positive constant that depends on α and γ (Nikias & Shao, 1995). This implies that the variance of SαS random variables is unbounded for α < 2. Also, note that a smaller α results in heavier tails. Infinite-variance random variables are considered to be appropriate candidates for sparse signals (Amini et al., 2011; Gribonval et al., 2012). Because an i.i.d. sequence of heavytailed random variables has most of its energy concentrated on a small fraction of samples, they are good candidates to model signals that exhibit sparse behavior. Yet, the truly fundamental aspect of α-stable random variables is their role in the generalized central limit theorem. As we know, the limit distribution of normalized sums of i.i.d. finite-variance random variables are Gaussian. Likewise, any properly normalized sum of heavy-tailed i.i.d. random variables converges to an α-stable random variable whee the α depends on the weight of their tail (Meerschaert & Scheffler, 2001). This implies that a linear combination of a large number of samples of a sparse signal is well represented by α-stable random variables. 2.2. Problem formulation Our underlying signal model is where x is an unknown n 1 random vector with SαS i.i.d. entries and α < 2, y is an m 1 observation vector and A is an m n dictionary matrix,. We are given K realizations of y; namely, y1, . . . , y K, and our goal is to estimate A. 3. Dictionary learning for SαS signals In the problem of dictionary learning, the maximum information that we can asymptotically try to retrieve from y1, . . . , y K is the exact distribution of y. However, even if we knew y, identifying A is still not tractable in general for instance, in the case of Gaussian vectors, A is only identifiable up to right-multiplication by a unitary matrix. Moreover, obtaining an acceptable estimate of the distribution of y requires, in general, a vast amount of data and processing power (since it is a m-dimensional function with m possibly large). In this section, we leverage the property of stability under linear combination of SαS random variables explained in Section 2.1 to propose a new algorithm to estimate A for the dictionary learning problem stated in Section 2.2. 3.1. New cost function for sparse SαS signals Recall that, the random vector y (see Equations (2) and (6)) is an m-dimensional α-stable vector with characteristic function bpy(ω) = exp γ A ω α α (7) for ω Rm. Thus, knowing A u α for all u Sm 1, where Sm 1 is the (m 1)-dimensional unit sphere, i.e., Sm 1 = {u Rm | u 2 = 1} , (8) is equivalent to knowing the distribution of y. Note that u y = u Ax (see Equations (3) and (4)) is an SαS random variable with dispersion γ(u) = γ A u α α. (9) Thus, knowing the dispersion of the marginal distributions of y for all u Sm 1 is equivalent to knowing the distribution of y. In other words, in the case of SαS random vectors, knowing their marginal dispersions is equivalent to knowing the Radon transform of their PDFs or, equivalently, their joint characteristic function (Helgason, 2010). Due to the relationship between the Radon transform and the field of tomography, we call our algorithm sparse distribution tomography (Spars DT). Another interesting fact is that, in the non-Gaussian case (α < 2), knowing the marginal dispersions of y, i.e., γ(u), identifies the matrix A uniquely, up to negation and permutation of the columns. Formally, we have the following theorem, which is proved in Appendix A: Theorem 1 Let A be an m n matrix where columns are pairwise linearly independent. If α (0, 2) and B is an m n matrix for which we have A u α α = B u α α (10) for all u Rm, then B is equal to A up to negation and permutation of its columns. Remark 1 This theorem can be seen as a generalization of the result in (Rolewicz, 1985) that states that the isometries of ℓp-norms are generalized permutation matrices (permutation matrices with some of their rows negated). To the best of our knowledge, this result is novel and could be of independent interest. This theorem suggests that in order to find A all we need is to find γ(u) for u Rm. Intuitively, we can say that as A has a finite number of parameters (entries), A is identifiable based on the knowledge of γ(u) for an appropriate finite set of vectors u = u1, . . . , u L (for some L mn). We can then solve the set of non-linear equations γ B u1 α α = γ(u1), ... (11) γ B u L α α = γ(u L), Dictionary Learning Based on Sparse Distribution Tomography for B to obtain A. Now, the problem is to find γ(u) for a given u Rm. Recall that γ(u) is the dispersion of the SαS random variable u T y. As y1, . . . , y K are realizations of y, u T y1, . . . , u T y K are realizations of u T y. There is a rich literature on the estimation of the parameters of α-stable random variables through their realizations, see, e.g, (Nikias & Shao, 1995). We use the estimation from (Achim et al., 2015) in the following equation log ˆγ(u) = α k=1 log |u T yk| (α 1)ψ(1) (12) where ψ is the digamma function (ψ(1) is the negative of the Euler-Mascheroni constant and is approximately 0.5572), and ˆγ(u) denotes the estimation of γ(u). Note that ˆγ(u) tends to γ(u) when the number of observations, K, tends to infinity. This means that we can obtain the exact value of γ(u) asymptotically. However, non-exact values for γ(uℓ), for ℓ= 1, . . . , L (which is the case when K is finite), can lead to the nonexistence of a solution for the system of equations (11). To overcome this problem, instead of solving this system of equations exactly, we minimize the following objective function ℓ=1 d γ B uℓ α α, ˆγ(uℓ) (13) log(γ B uℓ α α) log(ˆγ(uℓ)) where log ˆγ(u1), . . . , log ˆγ(u L) are L numbers calculated from (12). The cost function d(a, b) = 1 α| log a log b| is a continuous positive function1 from R2 to R, whose global minimum is 0 and is reached over the line a = b. When ˆγ(u) = γ(u), B = A minimizes E(B). Thus, if ˆγ(u) is close enough to γ(u), due to the continuity of d, we expect that the minimizer of E will be close to A. Therefore, our approach to dictionary learning is to solve b A = argmin B E(B) (14) log γ B uℓ α α log ˆγ(uℓ) . The only parameter that needs to be set now is the stability parameter α. Note that the dispersion parameter γ in Equation (14) does not need to be set as it will be automatically 1In our simulations we also implemented other natural candidates for d(a, b) and all of them gave approximately the same performance. Due to the limited space, we do not present results for other cost functions. merged into the learned dictionary. Recall that there are well-known methods for estimating α from data; among which we use log |u yk| log ˆκ(u) 2 1 (15) from (Achim et al., 2015), where log ˆκ(u) = 1 k=1 log |u yk|. (16) This gives us an estimate for α for any given u Rm. Hence, the estimated value of α is the average over all ˆα(uℓ) for ℓ= 1, . . . , L, i.e., ℓ=1 ˆα(uℓ). (17) Now, using this estimate, Equation (12) becomes log ˆγ(u) = ˆα log ˆκ(u) (ˆα 1)ψ(1). (18) We also replace α with ˆα in Equation (13) which results in a parameter-free cost function. This is in contrast with most existing cost functions that have parameters one must set. 3.2. Learning algorithm To solve the minimization problem in Equation (14), we propose a variation on a gradient-descent algorithm with an adaptive step size that has a changing cost function. To do so, we first derive the gradient of E at B. Using matrix calculus (see Appendix B), we find that E(B) = (19) ℓ=1 sgn log(γ B uℓ α α) log ˆγ(uℓ) B uℓ α α B uℓ αα where sgn( ) is the sign function (i.e., sgn(e) = 1 if e > 0 and sgn(e) = 0 otherwise) and B u α α = α sgn b 1 u b 1 u α 1 u ... sgn b n u b n u α 1 u where bi is the ith column of B. The cost function in Equation (13) is non-convex in B. In order to avoid getting trapped in local minima, we iteratively change the cost function inside the gradient descent algorithm. The idea is that instead of keeping u1, . . . , u L fixed throughout the optimization process, we regenerate them randomly with a uniform distribution on Rm after some Dictionary Learning Based on Sparse Distribution Tomography iterations of steepest descent. We repeat this process until convergence. Note that (11) holds for any u1, . . . , u L and thus changing this set does not change the end result of (11). Remark 2 Using this idea always results in convergence to the global minimum in our computer simulations. A plausible explanation of this phenomenon is that each set of u1, . . . , u L yields a non-convex cost function with different local minima. Yet they all have the same global minimum. Therefore, switching between them during the optimization process prevents getting trapped in any of the local minima, which ultimately results in finding the global minimum of the cost function. The pseudocode of our dictionary learning method is given in Algorithm 1. There, η is the step size of the gradient descent that increases or decreases by factors of κ+ or κ upon taking a good or poor step. The adaptive step size is especially helpful for α 1, where the cost function is not smooth. The algorithm does not depend on the exact choice of convergence criteria. Remark 3 Algorithm 1 can also be seen as an efficient method for estimating the spectral measure of stable random vectors. In fact, the problem of estimating A from a set of realizations of y can also be seen as parameter estimation for a stable random vector y with a symmetric distribution around the origin. Such random vectors are parametrized by a measure Γ on Sm 1 that is called the spectral measure. In our problem, we have Γ( ) = Pn i=1 ai α δai( ) where the δais are unit point masses at ai ai 2 and ai is the ith column of A. Some methods have been proposed to solve this problem, e.g., (Nolan et al., 2001). However, they tend to be computationally intractable for dimensions greater than 3. Remark 4 According to the generalized central limit theorem, under some mild conditions, the distribution of the sum of symmetric heavy-tailed random variables tends to a SαS distribution as the number of summands tends to infinity. This means that we can represent u y = u Ax with an SαS random variable for large enough n irrespective of the distribution of the xis provided that the latter are heavy tailed. Therefore, we expect Algorithm 1 to find applications for other classes of sparse signals, provided that n is sufficiently large. 4. Empirical results In this section, we analyze the performance of the proposed algorithm Spars DT and compare it with existing methods. Recall that the actual dictionary is A and the learned dictionary is b A. We run two types of the experiments: We first test the algorithm on synthetic SαS data and then we test it on real images. Algorithm 1 Sparse DT 1: initialize: η > 0 2: initialize: κ+ 1 and κ 1 3: initialize: generate b1, . . . , bn N(0, Im m) and B b1| . . . |bn 4: repeat 5: initialize: generate u1, . . . , u L N(0, Im m) 6: estimate ˆα from (15) 7: E E(B) 8: repeat 9: Bold B 10: Eold E 11: B B η E(B) 12: E E(B) 13: if E Eold then 14: η κ+ η 15: else 16: B Bold 17: E Eold 18: η κ η 19: end if 20: until B converges (for this choice of u1, . . . , u L) 21: until B converges return B 4.1. Benchmarks We compare our algorithm with three commonly used algorithms that are available in the Python package SPAMS2. These constrained optimization problems3 are as follows: 1. ℓ2/ℓ1: Maximizing the data fidelity while controling the sparsity of representation with parameter λ1: b Aℓ2/ℓ1 = argmin B k=1 yk Bxk 2 2 s.t. xi 1 λ1. 2. ℓ1/ℓ2: Maximizing the sparsity of representation while controling the data fidelity with parameter λ2: b Aℓ1/ℓ2 = argmin B s.t. yk Bxk 2 λ2. 3. ℓ1 + ℓ2: Combining sparsity and data fidelity in the cost function using Lagrange multipliers: b Aℓ1+ℓ2 = argmin B k=1 yk Bxk 2 2 + λ3 xk 1 + λ4 xk 2 2. 2http://spams-devel.gforge.inria.fr/ 3Other cost functions are also available in the package SPAMS, but those retained here yield the best results in our experiments. Dictionary Learning Based on Sparse Distribution Tomography K 50 100 150 200 250 300 350 400 450 500 Avg Corrolation(%) exact α = 1 estimated α = 1 exact α = 1.5 estimated α = 1.5 Figure 2. Impact of the number of samples K on the average correlation for A16 24. One of the challenges in utilizing these benchmarks is determining the regularization parameters λ1, . . . , λ4. In our experiments, the regularization parameters are optimized (by grid search) in order to maximize the performance of each of the benchmarks above. This is in contrast to our algorithm, which has no regularization parameter to tune. 4.2. Experimental results for synthetic data We first test the algorithms on synthetic data. In order to quantify the performance of the algorithms, we use several metrics. One is the average correlation of the dictionaries. Specifically, we calculate the correlation between all columns of b A and A, and then match each column of b A with one of the columns of A (a one-to-one map) such that the average correlation between the corresponding columns is maximized. Additionally, we say that the dictionary is found if the average correlation of the columns is larger than 0.97. Effect of K: We demonstrate the effect of the number of samples K on the performance of our proposed algorithm Spars DT. Intuitively, the precision of the estimation increases with the number of samples K, and, as K goes to infinity, the estimation error goes to zero, which ultimately gives the exact A. We demonstrate this effect with the following experiment: We take m = 16, n = 24 and α = 1 and 1.5. Then, for each K, we run the experiment for 50 random matrices A, and, for each case, we run Algorithm 1 with both exact and estimated α (from (17)). The results are depicted in Figure 2, where the vertical axis is the average correlation of the estimated dictionary with the exact one, and the horizontal axis is the number of samples K. Interestingly, the performance of the algorithm is almost the same when using the exact or estimated value of α, which suggests that the estimator of α is robust and accurate. Moreover, we see that the average correlation is an increasing function of K, as expected. Also note that the convergence is faster for α = 1, which corresponds to the setting with more sparsity. Comparison metrics Algorithm % found Avg. time (s) Spars DT 100 5.19 ℓ2/ℓ1 50 0.07 ℓ1/ℓ2 75 95.75 ℓ1 + ℓ2 94 19.89 Table 1. Performance of Algorithms on SαS Signals. α = 1.2, A16 24 matrix, K = 500. Comparison against benchmarks: We compare Spars DT against the ℓ2/ℓ1, ℓ1/ℓ2 and ℓ1 + ℓ2 methods described above. We compare the algorithms with regard to their success rate (i.e., the percentage of the dictionaries found by the algorithm), and the time that they take to find the dictionary (in the cases of success only). We again take m = 16, n = 24 and generate 100 random matrices A. In Table 1, the results for α = 1.2 and K = 500 are given. Finally, in Figure 3 we compare the algorithms success rate for different α, we take m = 16, n = 24, and K = 1000. These results indicate that Spars DT outperforms the other methods in the rate of success. Also, its average learning time is typically much less than the others, except for ℓ2/ℓ1 which does not find the correct dictionary at best in 10% of the time. The range of α that was observed in our experiments is α [1, 1.6], which is also the range where our algorithm works well and which is interesting for many applications including image processing. We do not recommend using the method for α > 1.7 because the convergence properties degrade as we get closer to 2 (a larger value of K is then needed to reach high success rates). 4.3. Experimental results for real images Since images often have sparse representations, we apply our algorithm to problems in image processing applications. Our experiments are missing pixel recovery (in-painting) and denoising, based on dictionary learning. We use a 1.0 1.1 1.2 1.3 1.4 1.5 1.6 α % found dictionary ℓ2/ℓ1 ℓ1/ℓ2 ℓ1 + ℓ2 Spars DT Figure 3. Impact of α [1, 1.6] on the success rate of the algorithms. Dictionary Learning Based on Sparse Distribution Tomography Algorithm PSNR (d B) Spars DT 29.61 ℓ2/ℓ1 28.98 ℓ1/ℓ2 28.74 ℓ1 + ℓ2 28.98 Table 2. Performance of different methods for denoising images contaminated by Gaussian noise. database of face images provided by AT&T4 and crop them to have size 112 91 so we can chop each image to 208 patches of size 7 7, which correspond to yi in our model. In this situation, the data is not exactly SαS, so we must adapt our choice of u in Step 5 of Algorithm 1. Specifically, in Equation (17) we eliminate projection vectors u that result in α greater than 2 (as α is required to be less than 2). In addition, we only select u that results in an α close to our estimated ˆα in (17). The number of iterations are chosen such that all algorithms converge. Missing pixel recovery: In this experiment, we reconstruct an image from 50% of its pixels. We take out the image shown in Figure 4, remove 50% of its pixels uniformly at random, and learn the dictionary using the patches of the other images in the collection. We assume 248 atoms in the dictionary. Then, using the learned dictionary, we reconstruct the image using orthogonal matching pursuit (for a detailed analysis see (Sahoo & Makur, 2015)). The results for different dictionary learning methods are depicted in Figure 4; Spars DT outperforms existing methods both visually and in term of PSNR. Image denoising: In this experiment, we use the dictionaries learned in the previous experiment to denoise the image in Figure 4. More precisely, we add Gaussian noise with standard deviation 10 to the original image and use orthogonal matching pursuit to denoise it. The performance of each method in PSNR can be seen in Table 2. As we see, Spars DT outperforms the other methods by at least 0.6 d B. 5. Summary and future work In this paper, we consider a stochastic generation model of sparse signals that involves an SαS innovation. Then, by designing an estimator of the spectral measure of so-defined stable random vectors, we propose a new dictionary learning algorithm. The proposed algorithm (Spars DT) turns out to be quite robust; it works well on sparse real-world signals, even when these do not rigorously follow the SαS model. This surprising fact can be explained by invoking the generalized central limit theorem. We validate Spars DT on several image-processing tasks and found it to outperform popular dictionary learning methods often significantly. 4www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html Moreover, Spars DT has no parameter to tune, contrary to other algorithms. Extending this work to non-symmetric α-stable random variables is a possible direction of future research. Given the excellent numerical behavior of the algorithm, it is of interest to get a good handle on the accuracy of the estimation in terms of the number of samples and the dimensionality of signals. A. Proof of Theorem 1 Denote the jth column of A and B by aj and bj, respectively. Also, denote the set of indices j for which bj = 0 by B {1, . . . , n}. Note that due to the assumption of the pairwise linear independence of columns of A, aj = 0 for all j {1, . . . , n}. Since u bj α = B u α for all u Rm, the partial derivatives of any order of the two side of the equation are also equal. In particular, we have d for all i = 1, . . . , m and d N, where ui is the ith entry of u. First we prove the theorem for 0 < α 1. In (21), we set d = 1 and obtain u aj α 1 sgn u aj aij u bj α 1 sgn u bj bij. (22) Exploiting this equation, we prove the following lemma: Lemma 1 Under the assumptions of Theorem 1, for any j {1, . . . , n}, there exists j B and tj = 0, such that tj aj = bj. Proof: Take i such that ai j = 0. Also, for all r = 1, . . . , n, define Ka r = u Rm|u ar = 0 (23) which is an (m 1)-dimensional subspace of Rm. Since for any j = j , aj and aj are linearly independent, the subspace Ka j Ka j is (m 2)-dimensional. This implies that their (m 1)-dimensional Lebesgue measure is zero; and the same holds for the union S j =j Ka j Ka j . Since j =j Ka j = Ka j \ [ Ka j Ka j , (24) Dictionary Learning Based on Sparse Distribution Tomography Original image 50% missing pixels Spars DT PSNR 28.91 d B 2= 1 PSNR 27.6 d B 1= 2 PSNR 26.03 d B 1 + 2 PSNR 26.48 d B Figure 4. Comparing the proposed method with the existing ones for recovering missing pixels. we conclude that the (m 1)-dimensional Lebesgue measure of Ka j \ S j =j Ka j is infinity. Note that any u Ka j \ S j =j Ka j is only orthogonal to aj and not to any other column of A. This yields that if we set i = i in the left-hand side of (22), for any u Ka j \ S j =j Ka j , the only discontinuous term at u is the j th one (because the function |x|α 1sgn(x) has a single point of discontinuity at x = 0). As a result, the sum itself is discontinuous over Ka j \ S j =j Ka j . Hence, the same should hold for the right-hand side of the equation. Similar to (23), define Kb r = u Rm|u br = 0 . (25) The set of discontinuity points of the right-hand side of (22) is a subset of S j B Kb j. Therefore, we have j =j Ka j [ j B Kb j (26) which can also be written as j =j Ka j [ Ka j Kb j (27) Now, if none of the columns of B is linearly dependent to aj , all Ka j Kb j will be (m 2)-dimensional spaces, and their (m 1)-dimesnional Lebesgue measure is zero. This implies that the (m 1)-dimensional Lebesgue measure of the right-hand side of (27) is also zero, which contradicts the result after (24). Therefore, there exists a j B such that bj is linearly dependent to aj , which completes the proof of the lemma. The first consequence of Lemma 1 is that none of the columns of B is the zero vector and thus B = {1, . . . , n}. Also, since all pairs of columns of A are linearly independent, the correspondence between a column of A and a column of B that are linearly dependent is one-to-one. Thus, we can simplify (22) to be j=1 (1 |tj|) u aj α 1 sgn u aj aij = 0, (28) which holds for all u. This implies that the left hand-side of the above equation is a continuous function. However, as we saw in the proof of the lemma, every u Ka j \ S j =j Ka j is a discontinuity point of the left-hand unless 1 |t j| = 0 which completes the proof for the case of 0 < α 1. For the case of 1 < α < 2, we set d = 2 in (21) and obtain u aj α 2 a2 ij = u bj α 2 b2 ij. (29) Replacing (22) by (29), the same reasoning as for 0 < α 1 works to prove the theorem for 1 < α < 2. B. Derivation of the gradient of E(B) To calculate the gradient of E(B), we first calculate the gradient of B u α α using the definition of the gradient, i.e. B u α α, C = j=1 c j u sgn b j u b j u α 1 . Here, D, C = tr(D C) is the standard inner product on the space of matrices. Writing the last equation in the matrix form, we obtain (20). Now, using the fact d dx |log x| = sgn(log x) 1 x and the chain rule for differentiation yields (19). Dictionary Learning Based on Sparse Distribution Tomography Acknowledgements The research was partially supported by the Hasler Foundation under Grant 16009, by the European Research Council under Grant 692726 (H2020-ERC Project Global Bio Im) and by the SNF Project Grant (205121 163385). Achim, A. and Kuruoglu, E. Image denoising using bivariate α-stable distributions in the complex wavelet domain. IEEE Signal Processing Letters, 12(1):17 20, 2005. Achim, A., Basarab, A., Tzagkarakis, G., Tsakalides, P., and Kouamé, D. Reconstruction of ultrasound RF echoes modeled as stable random variables. IEEE Transactions on Computational Imaging, 1(2):86 95, June 2015. Aharon, M., Elad, M., and Bruckstein, A. K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing, 54(11):4311 4322, 2006. Amini, A., Unser, M., and Marvasti, F. Compressibility of deterministic and random infinite sequences. IEEE Transactions on Signal Processing, 59(11):5193 5201, November 2011. Bickson, D. and Guestrin, C. Inference with multivariate heavy-tails in linear models. In Advances in Neural Information Processing Systems, pp. 208 216, 2010. Gallagher, C. A method for fitting stable autoregressive models using the autocovariation function. Statistics & probability letters, 53(4):381 390, 2001. Georgiou, P., Tsakalides, P., and Kyriakakis, C. Alphastable modeling of noise and robust time-delay estimation in the presence of impulsive noise. IEEE transactions on multimedia, 1(3):291 301, 1999. Gribonval, R. and Lesage, S. A survey of sparse component analysis for blind source separation: principles, perspectives, and new challenges. In ESANN 06 proceedings14th European Symposium on Artificial Neural Networks, pp. 323 330. d-side publi., 2006. Gribonval, R., Cevher, V., and Davies, M. E. Compressible distributions for high-dimensional statistics. IEEE Transactions on Information Theory, 58(8):5016 5034, August 2012. Helgason, S. Integral Geometry and Radon Transforms. Springer, 2010. Herault, J. and Jutten, C. Space or time adaptive signal processing by neural network models. In Neural networks for computing, volume 151, pp. 206 211. AIP Publishing, 1986. Hyvärinen, A. and Oja, E. Independent component analysis: algorithms and applications. Neural networks, 13(4): 411 430, 2000. Hyvärinen, A., Karhunen, J., and Oja, E. Independent component analysis, volume 46. John Wiley & Sons, 2004. Kuruoglu, E. E., Fitzgerald, W. J., and Rayner, P. J. Near optimal detection of signals in impulsive noise modeled with a symmetric/spl alpha/-stable distribution. IEEE Communications Letters, 2(10):282 284, 1998. Lathauwer, L. De and Castaing, J. Blind identification of underdetermined mixtures by simultaneous matrix diagonalization. IEEE Transactions on Signal Processing, 56 (3):1096 1105, 2008. Lathauwer, L. De, Castaing, J., and Cardoso, J. Fourth-order cumulant-based blind identification of underdetermined mixtures. IEEE Transactions on Signal Processing, 55 (6):2965 2973, 2007. Ling, S. Self-weighted least absolute deviation estimation for infinite variance autoregressive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(3):381 393, 2005. Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11(Jan):19 60, 2010. Marvasti, F., Amini, A., Haddadi, F., Soltanolkotabi, M., Khalaj, B.H., Aldroubi, A., Sanei, S., and Chambers, J. A unified approach to sparse signal processing. EURASIP journal on advances in signal processing, 2012(1):1, 2012. Meerschaert, M. and Scheffler, H. Limit distributions for sums of independent random vectors: Heavy tails in theory and practice, volume 321. John Wiley & Sons, 2001. Middleton, D. Non-gaussian noise models in signal processing for telecommunications: new methods an results for class a and class b noise models. IEEE Transactions on Information Theory, 45(4):1129 1149, 1999. Naik, G. and Kumar, D. An overview of independent component analysis and its applications. Informatica, 35(1), 2011. Nikias, C. L. and Shao, M. Signal Processing with Alpha Stable Distributions and Applications. Wiley, New York, 1995. Nolan, JP. Modeling financial data with stable distributions. Handbook of Heavy Tailed Distributions in Finance, Handbooks in Finance: Book, 1:105 130, 2003. Dictionary Learning Based on Sparse Distribution Tomography Nolan, JP., Panorska, AK., and Mc Culloch, JH. Estimation of stable spectral measures. Mathematical and Computer Modelling, 34(9):1113 1122, 2001. Pad, P. and Unser, M. Optimality of operator-like wavelets for representing sparse AR(1) processes. IEEE Transactions on Signal Processing, 63(18):4827 4837, September 2015. Resnick, S. Heavy tail modeling and teletraffic data: special invited paper. The Annals of Statistics, 25(5):1805 1869, 1997. Rolewicz, S. Metric Linear Spaces. Mathematics and its applications (D. Reidel Publishing Company).: East European series. D. Reidel, 1985. Sahoo, S. and Makur, A. Signal recovery from random measurements via extended orthogonal matching pursuit. IEEE Trans. Signal Processing, 63(10):2572 2581, 2015. Samoradnitsky, G. and Taqqu, M. Stable non-Gaussian random processes: stochastic models with infinite variance, volume 1. CRC press, 1994. Shao, M. and Nikias, C. L. Signal processing with fractional lower order moments: stable processes and their applications. Proceedings of the IEEE, 81(7):986 1010, Jul 1993. Spielman, D., Wang, H., and Wright, J. Exact recovery of sparsely-used dictionaries. In COLT, pp. 37 1, 2012.