# provably_accurate_doublesparse_coding__6fc7b24c.pdf

Journal of Machine Learning Research 20 (2019) 1-43 Submitted 12/17; Revised 5/19; Published 9/19

Provably Accurate Double-Sparse Coding

Thanh V. Nguyen thanhng@iastate.edu Department of Electrical and Computer Engineering Iowa State University Ames, IA 50011, USA

Raymond K. W. Wong raywong@stat.tamu.edu Department of Statistics Texas A&M University College Station, TX 77843, USA

Chinmay Hegde chinmay@iastate.edu Department of Electrical and Computer Engineering Iowa State University Ames, IA 50011, USA

Editor: Animashree Anandkumar

Sparse coding is a crucial subroutine in algorithms for various signal processing, deep learning, and other machine learning applications. The central goal is to learn an overcomplete dictionary that can sparsely represent a given input dataset. However, a key challenge is that storage, transmission, and processing of the learned dictionary can be untenably high if the data dimension is high. In this paper, we consider the double-sparsity model introduced by Rubinstein et al. (2010b) where the dictionary itself is the product of a ﬁxed, known basis and a data-adaptive sparse component. First, we introduce a simple algorithm for double-sparse coding that can be amenable to eﬃcient implementation via neural architectures. Second, we theoretically analyze its performance and demonstrate asymptotic sample complexity and running time beneﬁts over existing (provable) approaches for sparse coding. To our knowledge, our work introduces the ﬁrst computationally eﬃcient algorithm for double-sparse coding that enjoys rigorous statistical guarantees. Finally, we corroborate our theory with several numerical experiments on simulated data, suggesting that our method may be useful for problem sizes encountered in practice. Keywords: Sparse coding, provable algorithms, unsupervised learning

1. Introduction

1.1. Motivation

Representing signals as sparse linear combinations of atoms from a dictionary is a popular approach in many domains. In this paper, we study the problem of dictionary learning (also known as sparse coding), where the goal is to learn an eﬃcient basis (dictionary) that represents the underlying class of signals well. In the typical sparse coding setup, the dictionary is overcomplete (i.e., the cardinality of the dictionary exceeds the ambient signal dimension) while the representation is sparse (i.e., each signal is encoded by a combination of only a few dictionary atoms.)

c 2019 Thanh V. Nguyen, Raymond K. W. Wong, and Chinmay Hegde.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/17-728.html.

Nguyen, Wong, and Hegde

Sparse coding has a rich history in diverse ﬁelds such as signal processing, machine learning, and computational neuroscience. Discovering optimal basis representations of data is a central focus of image analysis (Krim et al., 1999; Elad and Aharon, 2006; Rubinstein et al., 2010a), and dictionary learning has proven widely successful in imaging problems such as denoising, deconvolution, inpainting, and compressive sensing (Elad and Aharon, 2006; Candes and Tao, 2005; Rubinstein et al., 2010a). Sparse coding approaches have also been used as a core building block of deep learning systems for prediction (Gregor and Le Cun, 2010; Boureau et al., 2010) and associative memory (Mazumdar and Rawat, 2017). Interestingly, the seminal work by Olshausen and Field (1997) has shown intimate connections between sparse coding and neuroscience: the dictionaries learned from image patches of natural scenes bear remarkable resemblance to spatial receptive ﬁelds observed in the mammalian primary visual cortex. From a mathematical standpoint, the sparse coding problem is formulated as follows. Given p data samples Y = [y(1), y(2), . . . , y(p)] Rn p, the goal is to ﬁnd a dictionary D Rn m (m > n) and corresponding sparse code vectors X = [x(1), x(2), . . . , x(p)] Rm p

such that the representation DX ﬁts the data samples as well as possible. Typically, one obtains the dictionary and the code vectors as the solution to the following optimization problem:

min D,X L(D, X) = 1

j=1 y(j) Dx(j) 2 2,

j=1 S(x(j)) S

where S( ) is some sparsity-inducing penalty function on the code vectors, such as the ℓ1-norm. The objective function L controls the reconstruction error while the constraint enforces the sparsity of the representation. However, even a cursory attempt at solving the optimization problem (1) reveals the following obstacles: The constrained optimization problem (1) involves a non-convex (in fact, bilinear) objective function, as well as potentially non-convex constraints depending on the choice of the sparsity-promoting function S (for example, the ℓ0 function.) Hence, obtaining provably correct algorithms for this problem can be challenging. Indeed, the vast majority of practical approaches for sparse coding have been heuristics (Engan et al., 1999; Aharon et al., 2006; Mairal et al., 2009). Recent works in the theoretical machine learning community have bucked this trend, providing provably accurate algorithms if certain assumptions are satisﬁed (Spielman et al., 2012; Agarwal et al., 2014; Arora et al., 2014a, 2015; Sun et al., 2015; B lasiok and Nelson, 2016; law Adamczak, 2016; Chatterji and Bartlett, 2017). However, relatively few of these newer methods have been shown to provide good empirical performance in actual sparse coding problems. Even if theoretical correctness issues were to be set aside, and we are somehow able to eﬃciently learn sparse codes of the input data, we often ﬁnd that applications using such learned sparse codes encounter memory and running-time issues. Indeed, in the overcomplete case, the storage of the learned dictionary D incurs mn = Ω(n2) memory cost, which is prohibitive when n is large. Therefore, in practical applications (such as image analysis)

Provably Accurate Double-Sparse Coding

one typically resorts to chop the data into smaller blocks (e.g., partitioning image data into patches) to make the problem manageable. A related line of research has been devoted to learning dictionaries that obey some type of structure. Such structural information can be leveraged to incorporate prior knowledge of underlying signals as well as to resolve computational challenges due to the data dimension. For instance, the dictionary is assumed to be separable, or obey a convolutional structure. One such variant is the double-sparse coding problem (Rubinstein et al., 2010b; Sulam et al., 2016) where the dictionary D itself exhibits a sparse structure. To be speciﬁc, the dictionary is expressed as:

i.e., it is composed of a known base dictionary Φ Rn n, and a learned synthesis matrix A Rn m whose columns are sparse. The base dictionary Φ is typically any ﬁxed basis chosen according to domain knowledge, while the synthesis matrix A is column-wise sparse and is to be learned from the data. The basis Φ is typically orthonormal (such as the canonical or wavelet basis); however, there are cases where the base dictionary Φ is overcomplete (Rubinstein et al., 2010b; Sulam et al., 2016). There are several reasons why such the double-sparsity model can be useful. First, the double-sparsity assumption is rather appealing from a conceptual standpoint, since it lets us combine the knowledge of decades of modeling eﬀorts in harmonic analysis with the ﬂexibility of learning new representations tailored to speciﬁc data families. Moreover, such a double-sparsity model has computational beneﬁts. If the columns of A are (say) r-sparse (i.e., each column contains no more than r n non-zeroes) then the overall burden of storing, transmitting, and computing with A is much lower than that for general unstructured dictionaries. Finally, such a model lends itself well to interpretable learned features if the atoms of the base dictionary are semantically meaningful. All the above reasons have spurred researchers to develop a series of algorithms to learn doubly-sparse codes (Rubinstein et al., 2010b; Sulam et al., 2016). However, despite their empirical promise, no theoretical analysis of their performance have been reported in the literature and to date, we are unaware of a provably accurate, polynomial-time algorithm for the double-sparse coding problem. Our goal in this paper is precisely to ﬁll this gap.

1.2. Our Contributions

In this paper, we provide a new framework for double-sparse coding. To the best of our knowledge, our approach is the ﬁrst method that enjoys provable statistical and algorithmic guarantees for this problem. In addition, our approach enjoys two beneﬁts: we demonstrate that the method is neurally plausible (i.e., its execution can plausibly be achieved using a neural network architecture) and robust to noise. Inspired by the aforementioned recent theoretical advances in sparse coding, we assume a learning-theoretic setup where the data samples arise from a ground-truth generative model. Informally, suppose there exists a true (but unknown) synthesis matrix A that is column-wise r-sparse, and the ith data sample is generated as:

y(i) = ΦA x (i) + noise, i = 1, 2, . . . , p

Nguyen, Wong, and Hegde

where the code vector x (i) is independently drawn from a distribution supported on the set of k-sparse vectors. We desire to learn the underlying matrix A . Informally, suppose that the synthesis matrix A is incoherent (the columns of A are suﬃciently close to orthogonal) and has bounded spectral norm. Finally, suppose that the number of dictionary elements, m, is at most a constant multiple of n. All of these assumptions are standard1. We will demonstrate that the true synthesis matrix A can be recovered (with small error) in a tractable manner as suﬃciently many samples are provided. Speciﬁcally, we make the following novel contributions:

1. We propose a new algorithm that produces a coarse estimate of the synthesis matrix that is suﬃciently close to the ground truth A . Our algorithm builds upon spectral initialization-based ideas that have recently gained popularity in non-convex machine learning (Zhang et al., 2016; Wang et al., 2016).

2. Given the above coarse estimate of the synthesis matrix A , we propose a descentstyle algorithm to reﬁne the above estimate of A . This algorithm is simpler than previously studied double-sparse coding algorithms (such as the Trainlets approach of Sulam et al. (2016)), while still giving good statistical performance. Moreover, this algorithm can be realized in a manner amenable to neural implementations.

3. We provide a rigorous analysis of both algorithms. Put together, our analysis produces the ﬁrst provably polynomial-time algorithm for double-sparse coding. We show that the algorithm provably returns a good estimate of the ground-truth; in particular, in the absence of noise we prove that Ω(mr polylog n) samples are suﬃcient for a good enough initialization in the ﬁrst algorithm, as well as guaranteed linear convergence of the descent phase up to a precise error parameter that can be interpreted as the radius of convergence.

Indeed, our analysis shows that employing the double-sparsity model helps in this context, and leads to a strict improvement in sample complexity, as well as running time over previous rigorous methods for (regular) sparse coding such as Arora et al. (2015).

4. We also analyze our approach in a more realistic setting with the presence of additive noise and demonstrate its stability. We prove that Ω(mr polylog n) samples are suﬃcient to obtain a good enough estimate in the initialization, and also to obtain guaranteed linear convergence during descent to provably recover A .

5. We underline the beneﬁt of the double-sparse structure over the regular model by analyzing the algorithms in Arora et al. (2015) under the noisy setting. As a result, we obtain the sample complexity O (mk + σ2 ε mn2

k )polylog n , which demonstrates a negative eﬀect of noise on this approach.

6. We rigorously develop a hard thresholding intialization that extends the spectral scheme in Arora et al. (2015). Additionally, we provide more results for the case where A is orthonormal, sparse dictionary to relax the condition on r, which may be of independent interest.

1. We clarify both the data and the noise model more concretely in Section 2 below.

Provably Accurate Double-Sparse Coding

Setting Reference Sample (w/o noise)

Sample (w/ noise) Time Expt

MOD (Engan et al., 1999)

K-SVD (Aharon et al., 2006)

Spielman et al. (2012) O(n2 log n) eΩ(n4)

Arora et al. (2014b) e O(m2/k2) e O(np2)

Gribonval et al. (2015a) O(nm3) O(nm3)

Arora et al. (2015) e O(mk) e O(mn2p)

Double Sparse

Double Sparsity (Rubinstein et al., 2010b)

Gribonval et al. (2015b) e O(mr) e O(mr)

Trainlets (Sulam et al., 2016)

This paper e O(mr) e O(mr + σ2 ε mnr

k ) e O(mnp)

Table 1: Comparison of various sparse coding techniques. Expt: whether numerical experiments have been conducted. in all other columns indicates no provable guarantees. Here, n is the signal dimension, and m is the number of atoms. The sparsity levels for A and x are r and k respectively, and p is the sample size.

7. While our analysis mainly consists of suﬃciency results and involves unknown constants hidden in big-O notation, we demonstrate our ﬁndings by reporting a suite of numerical experiments on synthetic test datasets.

Overall, our approach results in strict improvement in sample complexity, as well as running time, over previous rigorously analyzed methods for (regular) sparse coding, such as Arora et al. (2015). See Table 1 for a detailed comparison.

1.3. Techniques

At a high level, our method is an adaptation of the seminal approach of Arora et al. (2015). As is common in the statistical learning literature, we assume a ground-truth generative model for the observed data samples, and attempt to estimate the parameters of the generative model given a suﬃcient number of samples. In our case, the parameters correspond to the synthesis matrix A , which is column-wise r-sparse. The natural approach is to formulate a loss function in terms of A such as Equation (1), and perform gradient descent with respect to the surface of the loss function to learn A . The key challenge in sparse coding is that the gradient is inherently coupled with the codes of the training samples (i.e., the columns of X ), which are unknown a priori. However, the main insight of Arora et al. (2015) is that within a small enough neighborhood of A , a noisy version of X can be estimated, and therefore the overall method is similar to performing approximate gradient descent. Formulating the actual algorithm as a noisy variant of approximate gradient descent allows us to overcome the ﬁnite-sample variability

Nguyen, Wong, and Hegde

of the loss, and obtain a descent property directly related to (the population parameter) A .

The second stage of our approach (i.e., our descent-style algorithm) leverages this intuition. However, instead of standard gradient descent, we perform approximate projected gradient descent, such that the column-wise r-sparsity property is enforced in each new estimate of A . Indeed, such an extra projection step is critical in showing a sample complexity improvement over the existing approach of Arora et al. (2015).The key novelty is in ﬁguring out how to perform the projection in each gradient iteration. For this purpose, we develop a novel initialization algorithm that identiﬁes the locations of the non-zeroes in A

even before commencing the descent phase. This is nontrivially diﬀerent from initialization schemes used in previous rigorous methods for sparse coding, and the analysis is somewhat more involved.

In Arora et al. (2015), (the principal eigenvector of) a weighted covariance matrix of y (estimated by the weighted average of outer products yiy T i ) is shown to provide a coarse estimate of a dictionary atom. We extend this idea and rigoriously show that the diagonal of the weighted covariance matrix serves as a good indicator of the support of a column in A . The success relies on the concentration of the diagonal vector with dimension n, instead of the covariance matrix with dimensions n n. With the support selected, our scheme only utilizes a reduced weighted covariance matrix with dimensions at most r r. This initialization scheme enables us to eﬀectively reduce the dimension of the problem, and therefore leads to signiﬁcant improvement in sample complexity and running time over previous (provable) sparse coding methods when the data representation sparsity k is much smaller than m.

Further, we rigorously analyze the proposed algorithms in the presence of noise with a bounded expected norm. Our analysis shows that our method is stable, and in the case of i.i.d. Gaussian noise with bounded expected ℓ2-norms, is at least a polynomial factor better than previous polynomial time algorithms for sparse coding.

The empirical performance of our proposed method is demonstrated by a suite of numerical experiments on synthetic datasets. In particular, we show that our proposed methods are simple and practical, and improve upon previous provable algorithms for sparse coding.

1.4. Paper Organization

The remainder of this paper is organized as follows. Section 2 introduces notation, key model assumptions, and informal statements of our main theoretical results. Section 3 outlines our initialization algorithm (along with supporting theoretical results) while Section 4 presents our descent algorithm (along with supporting theoretical results). Section 5 provides a numerical study of the eﬃciency of our proposed algorithms, and compares it with previously proposed methods. Finally, Section 6 concludes with a short discussion. All technical proofs are relegated to the appendix.

Provably Accurate Double-Sparse Coding

2. Setup and Main Results

2.1. Notation

We deﬁne [m] {1, . . . , m} for any integer m > 1. For any vector x = [x1, x2, . . . , xm]T Rm, we write supp(x) {i [m] : xi = 0} as the support set of x. Given any subset S [m], x S corresponds to the sub-vector of x indexed by the elements of S. For any matrix A Rn m, we use A i and AT j to represent the i-th column and the j-th row respectively. For some appropriate sets R and S, let AR (respectively, A S) be the submatrix of A with rows (respectively columns) indexed by the elements in R (respectively S). In addition, for the i-th column A i, we use AR,i to denote the sub-vector indexed by the elements of R. For notational simplicity, we use AT R to indicate (AR )T , the tranpose of A after a row selection. Besides, we use and sgn( ) to represent the element-wise Hadamard operator and the element-wise sign function respectively. Further, threshold K(x) is a thresholding operator that replaces any elements of x with magnitude less than K by zero. The ℓ2-norm x for a vector x and the spectral norm A for a matrix A appear several times. In some cases, we also utilize the Frobenius norm A F and the operator norm A 1,2 max x 1 1 Ax . The norm A 1,2 is essentially the maximal Euclidean norm of any column of A. For clarity, we adopt asymptotic notations extensively. We write f(n) = O(g(n)) (or f(n) = Ω(g(n))) if f(n) is upper bounded (respectively, lower bounded) by g(n) up to some positive constant. Next, f(n) = Θ(g(n)) if and only if f(n) = O(g(n)) and f(n) = Ω(g(n)). Also eΩand e O represent Ωand O up to a multiplicative poly-logarithmic factor respectively. Finally f(n) = o(g(n)) (or f(n) = ω(g(n))) if limn |f(n)/g(n)| = 0 (limn |f(n)/g(n)| = ). Throughout the paper, we use the phrase with high probability (abbreviated to w.h.p.) to describe an event with failure probability of order at most n ω(1). In addition, g(n) = O (f(n)) means g(n) Kf(n) for some small enough constant K.

2.2. Generative Model of Data

Suppose that the observed samples are given by

y(i) = Dx (i) + ε, i = 1, . . . , p,

i.e., we are given p samples of y generated from a ﬁxed (but unknown) dictionary D where the sparse code x and the error ε are drawn from a joint distribution D speciﬁed below. In the double-sparse setting, the dictionary is assumed to follow a decomposition D = ΦA , where Φ Rn n is a known orthonormal basis matrix and A is an unknown, ground truth synthesis matrix. An alternative (and interesting) setting is an overcomplete Φ with a square A , which our analysis below does not cover; we defer this to future work. Our approach relies upon the following assumptions on the synthesis dictionary A :

A1 A is overcomplete (i.e., m n) with m = O(n).

A2 A is µ-incoherent, i.e., for all i = j, | A i, A j | µ/ n.

A3 A i has at most r non-zero elements, and is normalized such that A i = 1 for all i. Moreover, |A ij| τ for A ij = 0 and τ = Ω(1/ r).

Nguyen, Wong, and Hegde

A4 A has bounded spectral norm such that A O( p

All these assumptions are standard. In Assumption A2, the incoherence µ is typically of order O(log n) with high probability for a normal random matrix (Arora et al., 2014b). Assumption A3 is a common assumption in sparse signal recovery. The bounded spectral norm assumption is also standard (Arora et al., 2015). In addition to Assumptions A1-A4, we make the following distributional assumptions on D:

B1 Support S = supp(x ) is of size at most k and uniformly drawn without replacement from [m] such that P[i S] = Θ(k/m) and P[i, j S] = Θ(k2/m2) for some i, j [m] and i = j.

B2 The nonzero entries x S are pairwise independent and sub-Gaussian given the support S with E[x i |i S] = 0 and E[x 2 i |i S] = 1.

B3 For i S, |x i | C where 0 < C 1.

B4 The additive noise ε has i.i.d. Gaussian entries with variance σ2 ε with σε = O(1/ n).

For the rest of the paper, we set Φ = In, the identity matrix of size n. This only simpliﬁes the arguments but does not change the problem because one can study an equivalent model:

y = Ax + ε ,

where y = ΦT y and ε = ΦT ε, as ΦT Φ = In. Due to the Gaussianity of ε, ε also has independent entries. Although this property is speciﬁc to Gaussian noise, all the analysis carried out below can be extended to sub-Gaussian noise with minor (but rather tedious) changes in concentration arguments. Our goal is to devise an algorithm that produces a provably good estimate of A . For this, we need to deﬁne a suitable measure of goodness . We use the following notion of distance that measures the maximal column-wise diﬀerence in ℓ2-norm under some suitable transformation.

Deﬁnition 1 ((δ, κ)-nearness) A is said to be δ-close to A if there is a permutation π : [m] [m] and a sign ﬂip σ : [m] : { 1} such that σ(i)A π(i) A i δ for every i. In addition, A is said to be (δ, κ)-near to A if A π A κ A also holds.

For notational simplicity, in our theorems we simply replace π and σ in Deﬁnition 1 with the identity permutation π(i) = i and the positive sign σ( ) = +1 while keeping in mind that in reality we are referring to one element of the equivalence class of all permutations and sign ﬂip transforms of A . We will also need some technical tools from Arora et al. (2015) to analyze our gradient descent-style method. Consider any iterative algorithm that looks for a desired solution z Rn to optimize some function f(z). Suppose that the algorithm produces a sequence of estimates z1, . . . , zs via the update rule:

zs+1 = zs ηgs,

for some vector gs and scalar step size η. The goal is to characterize good directions gs

such that the sequence converges to z under the Euclidean distance. The following gives one such suﬃcient condition for gs.

Provably Accurate Double-Sparse Coding

Deﬁnition 2 A vector gs at the sth iteration is (α, β, γs)-correlated with a desired solution z if gs, zs z α zs z 2 + β gs 2 γs.

We know from convex optimization that if f is 2α-strongly convex and 1/2β-smooth, and gs is chosen as the gradient zf(z), then gs is (α, β, 0)-correlated with z . In our setting, the desired solution corresponds to A , the ground-truth synthesis matrix. In Arora et al. (2015), it is shown that gs = Ey[(Asx y)sgn(x)T ], where x = threshold C/2((As)T y) indeed satisﬁes Deﬁnition 2. This gs is a population quantity and not explicitly available, but one can estimate such gs using an empirical average. The corresponding estimator bgs is a random variable, so we also need a related correlated-with-high-probability condition:

Deﬁnition 3 A direction bgs at the sth iteration is (α, β, γs)-correlated-w.h.p. with a desired solution z if, w.h.p.,

bgs, zs z α zs z 2 + β bgs 2 γs.

From Deﬁnition 2, one can establish a form of descent property in each update step, as shown in Theorem 1.

Theorem 1 Suppose that gs satisﬁes the condition described in Deﬁnition 2 for s = 1, 2, . . . , T. Moreover, 0 < η 2β and γ = max T s=1 γs. Then, the following holds for all s:

zs+1 z 2 (1 2αη) zs z 2 + 2ηγs.

In particular, the above update converges geometrically to z with an error γ/α. That is,

zs+1 z 2 (1 2αη)s z0 z 2 + 2γ/α.

We can obtain a similar result for Deﬁnition 3 except that zs+1 z 2 is replaced with its expectation. Armed with the above tools, we now state some informal versions of our main results:

Theorem 2 (Provably correct initialization, informal) There exists a neurally plausible algorithm to produce an initial estimate A0 that has the correct support and is (δ, 2)- near to A with high probability. Its running time and sample complexity are e O(mnp) and e O(mr) respectively. This algorithm works when the sparsity level satisﬁes r = O (log n).

Our algorithm can be regarded as an extension of Arora et al. (2015) to the double-sparse setting. It reconstructs the support of one single column and then estimates its direction in the subspace deﬁned by the support. Our proposed algorithm enjoys neural plausibility by implementing a thresholding non-linearity and Oja s update rule. We provide a neural implementation of our algorithm in Appendix G. The adaption to the sparse structure results in a strict improvement upon the original algorithm both in running time and sample complexity. However, our algorithm is limited to the sparsity level r = O (log n), which is rather small but plausible from the modeling standpoint. For comparison, we analyze a natural extension of the algorithm of Arora et al. (2015) with an extra hard-thresholding

Nguyen, Wong, and Hegde

step for every learned atom. We obtain the same order restriction on r, but somewhat worse bounds on sample complexity and running time. The details are found in Appendix F. We hypothesize that a stronger incoherence assumption can lead to provably correct initialization for a much wider range of r. For purposes of theoretical analysis, we consider the special case of a perfectly incoherent synthesis matrix A such that µ = 0 and m = n. In this case, we can indeed improve the sparsity parameter to r = O min( n log2 n, n k2 log2 n) , which is an exponential improvement. This analysis is given in Appendix E. The next theorem summarizes our result for the descent algorithm:

Theorem 3 (Provably correct descent, informal) There exists a neurally plausible algorithm for double-sparse coding that converges to A with geometric rate when the initial estimate A0 has the correct support and (δ, 2)-near to A . The running time per iteration is O(mkp + mrp) and the sample complexity is e O(m + σ2 ε mnr

Similar to Arora et al. (2015), our proposed algorithm enjoys neural plausibility. Moreover, we can achieve a better running time and sample complexity per iteration than previous methods, particularly in the noisy case. We show in Appendix F that in this regime the sample complexity of Arora et al. (2015) is e O(m + σ2 ε mn2

k ). For instance, when σε n 1/2, the sample complexity bound is signiﬁcantly worse than e O(m) in the noiseless case. In contrast, our proposed method leverages the sparse structure to overcome this problem and obtain improved results. We are now ready to introduce our methods in detail. As discussed above, our approach consists of two stages: an initialization algorithm that produces a coarse estimate of A , and a descent-style algorithm that reﬁnes this estimate to accurately recover A .

3. Stage 1: Initialization

In this section, we present a neurally plausible algorithm that can produce a coarse initial estimate of the ground truth A . We give a neural implementation of the algorithm in Appendix G. Our algorithm is an adaptation from the algorithm in Arora et al. (2015). The idea is to estimate dictionary atoms in a greedy fashion by iteratively re-weighting the given samples. The samples are re-scaled in a way that the weighted (sample) covariance matrix has the dominant ﬁrst singular value, and its corresponding eigenvector is close to one particular atom with high probability. However, while this algorithm is conceptually very appealing, it incurs severe computational costs in practice. More precisely, the overall running time is e O(mn2p) in expectation, which is unrealistic for large-scale problems. To overcome this burden, we leverage the double-sparsity assumption in our generative model to obtain a more eﬃcient approach. The high-level idea is to ﬁrst estimate the support of each column in the synthesis matrix A , and then obtain a coarse estimate of the nonzero coeﬃcients of each column based on knowledge of its support. The key ingredient of our method is a novel spectral procedure that gives us an estimate of the column supports purely from the observed samples. The full algorithm, that we call Truncated Pairwise Reweighting, is listed in pseudocode form as Algorithm 1.

Provably Accurate Double-Sparse Coding

Algorithm 1 Truncated Pairwise Reweighting

Initialize L = Randomly divide p samples into two disjoint sets P1 and P2 of sizes p1 and p2 respectively While |L| < m. Pick u and v from P1 at random

For every l = 1, 2, . . . , n; compute

i=1 y(i), u y(i), v (y(i) l )2

Sort (be1, be2, . . . , ben) in descending order If r r s.t be(r ) O(k/mr) and be(r +1)/be(r ) < O (r/ log2 n)

Let b R be set of the r largest entries of be

p2 Pp2 i=1 y(i), u y(i), v y(i) b R (y(i) b R )T

δ1, δ2 top singular values of c Mu,v z b R top singular vector of c Mu,v

If δ1 Ω(k/m) and δ2 < O (k/m log n)

If dist( z, l) > 1/ log n for any l L

Update L = L {z} Return A0 = (L1, . . . , Lm)

Let us provide some intuition of our algorithm. Fix a sample y = A x + ε from the available training set, and consider samples

u = A α + εu, v = A α + εv.

Now, consider the (very coarse) estimate for the sparse code of u with respect to A :

β = A T u = A T A α + A T εu.

As long as A is incoherent enough and εu is small, the estimate β behaves just like α, in the sense that for each sample y:

y, u x , β x , α .

Moreover, the above inner products are large only if α and x share some elements in their supports; else, they are likely to be small. Likewise, the weight y, u y, v depends on whether or not x shares the support with both α and α . Now, suppose that we have a mechanism to isolate pairs u and v who share exactly one atom among their sparse representations. Then by scaling each sample y with an increasing function of y, u y, v and linearly adding the samples, we magnify the importance of the samples that are aligned with that atom, and diminish the rest. The ﬁnal direction can be obtained via the top principal component of the reweighted samples and hence can be used as a coarse estimate of the atom. This is exactly the approach adopted in Arora et al. (2015). However, in our double-sparse coding setting, we know that the estimated atom

Nguyen, Wong, and Hegde

should be sparse as well. Therefore, we can naturally perform an extra sparsiﬁcation step of the output. An extended algorithm and its correctness are provided in Appendix F. However, as we discussed above, the computational complexity of the re-weighting step still remains. We overcome this obstacle by ﬁrst identifying the locations of the nonzero entries in each atom. Speciﬁcally, deﬁne the matrix:

i=1 y(i), u y(i), v y(i)y(i)T .

Then, the diagonal entries of Mu,v reveals the support of the atom of A shared among u and v: the r-largest entries of Mu,v will correspond to the support we seek. Since the desired direction remains unchanged in the r-dimensional subspace of its nonzero elements, we can restrict our attention to this subspace, construct a reduced covariance matrix c Mu,v, and proceed as before. This truncation step alleviates the computational burden by a signiﬁcant amount; the running time is now e O(mnp), which improves the original by a factor of n. The success of the above procedure relies upon whether or not we can isolate pairs u and v that share one dictionary atom. Fortunately, this can be done via checking the decay of the singular values of the (reduced) covariance matrix. Here too, we show via our analysis that the truncation step plays an important role. Overall, our proposed algorithm not only accelerates the initialization in terms of running time, but also improves the sample complexity over Arora et al. (2015). The performance of Algorithm 1 is described in the following theorem, whose formal proof is deferred to Appendix B.

Theorem 4 Suppose that Assumptions B1-B4 hold and Assumptions A1-A3 satify with µ = O n k log3 n and r = O (log n). When p1 = eΩ(m) and p2 = eΩ(mr), then with high

probability Algorithm 1 returns an initial estimate A0 whose columns share the same support as A and with (δ, 2)-nearness to A with δ = O (1/ log n).

The limit on r arises from the minimum non-zero coeﬃcient τ of A . Since the columns of A are standardized, τ should degenerate as r grows. In other words, it is getting harder to distinguish the signal coeﬃcients from zero as r grows with n. However, this limitation can be relaxed when a better incoherence available, for example the orthonormal case. We study this in Appendix E. To provide some intuition about the working of the algorithm (and its proof), let us analyze it in the case where we have access to inﬁnite number of samples. This setting, of course, is unrealistic. However, the analysis is much simpler and more transparent since we can focus on expected values rather than empirical averages. Moreover, the analysis reveals several key lemmas, which we will reuse extensively for proving Theorem 4. First, we give some intuition behind the deﬁnition of the scores , bel.

Lemma 1 Fix samples u and v and suppose that y = A x + ε is a random sample independent of u, v. The expected value of the score for the ℓth component of y is given by:

el E[ y, u y, v y2 l ] = X

i U V qiciβiβ i A 2 li + perturbation terms

Provably Accurate Double-Sparse Coding

where qi = P[i S], qij = P[i, j S] and ci = E[x4 i |i S]. Moreover, the perturbation terms have absolute value at most O (k/m log n).

From Assumption B1, we know that qi = Θ(k/m), qij = Θ(k2/m2) and ci = Θ(1). Besides, we will show later that |βi| |αi| = Ω(1) for i U, and |βi| = o(1) for i / U. Consider the ﬁrst term E0 = P i U V qiciβiβ i A 2 li . Clearly, E0 = 0 if U V = or that l does not belong to support of any atom in U V . On the contrary, as E0 = 0 and U V = {i} , then E0 = |qiciβiβ i A 2 li | Ω(τ 2k/m) = Ω(k/mr) since |qiciβiβ i| Ω(k/m) and |A li| τ. Therefore, Lemma 1 suggests that if u and v share a unique atom among their sparse representations, and r is not too large, then we can indeed recover the correct support of the shared atom. When this is the case, the expected scores corresponding to the nonzero elements of the shared atom will dominate the remaining of the scores. Now, given that we can isolate the support R of the corresponding atom, the remaining questions are how best we can estimate its non-zero coeﬃcients, and when u and v share a unique elements in their supports. These issues are handled in the following lemmas.

Lemma 2 Suppose that u = A α + εu and v = A α + εv are two random samples. Let U and V denote the supports of α and α respectively. R is the support of some atom of interest. The truncated re-weighting matrix is formulated as

Mu,v E[ y, u y, v y Ry T R] = X

i U V qiciβiβ i A R,i A T R,i + perturbation terms

where the perturbation terms have norms at most O (k/m log n).

Using the same argument for bounding E0 in Lemma 1, we can see that M0 qiciβiβ i A R,i A T R,i has norm at least Ω(k/m) when u and v share a unique element i ( A R,i = 1). According to this lemma, the spectral norm of M0 dominates those of the other perturbation terms. Thus, given R we can use the ﬁrst singular vector of Mu,v as an estimate of A i.

Lemma 3 Under the setup of Theorem 4, suppose u = A α + εu and v = A α + εv are two random samples with supports U and V respectively. R = supp(A i ). If u and v share the unique atom i, the ﬁrst r largest entries of el is at least Ω(k/mr) and belong to R. Moreover, the top singular vector of Mu,v is δ-close to A R,i for O (1/ log n).

Proof The recovery of A i s support directly follows Lemma 1. For the latter part, recall from Lemma 2 that

Mu,v = qiciβiβ i A R,i A T R,i + perturbation terms

The perturbation terms have norms bounded by O (k/m log n). On the other hand, the ﬁrst term is has norm at least Ω(k/m) since A R,i = 1 for the correct support R and |qiciβiβ i| Ω(k/m). Then using Wedin s Theorem to Mu,v, we can conclude that the top singular vector must be O (k/m log n)/Ω(k/m) = O (1/ log n) -close to A R,i.

Nguyen, Wong, and Hegde

Lemma 4 Under the setup of Theorem 4, suppose u = A α + εu and v = A α + εv are two random samples with supports U and V respectively. If the top singular value of Mu,v is at least Ω(k/m) and the second largest one is at most O (k/m log n), then u and v share a unique dictionary element with high probability.

Proof The proof follows from that of Lemma 37 in Arora et al. (2015). The main idea is to separate the possible cases of how u and v share support and to use Lemma 2 with the bounded perturbation terms to conclude when u and v share exactly one. We note that due to the condition where be(s) Ω(k/mr) and be(s+1)/be(s) O (r/ log n), it must be the case that u and v share only one atom or share more than one atoms with the same support. When their supports overlap more than one, then the ﬁrst singular value cannot dominate the second one, and hence it must not be the case. Similar to (Arora et al., 2015), our initialization algorithm requires e O(m) iterations in expectation to estimate all the atoms, hence the expected running time is e O(mnp). All the proofs of Lemma 1 and 2 are deferred to Appendix B.

4. Stage 2: Descent

We now adapt the neural sparse coding approach of Arora et al. (2015) to obtain an improved estimate of A . As mentioned earlier, at a high level the algorithm is akin to performing approximate gradient descent. The insight is that within a small enough neighborhood (in the sense of δ-closeness) of the true A , an estimate of the ground-truth code vectors, X , can be constructed using a neurally plausible algorithm. The innovation, in our case, is the double-sparsity model since we know a priori that A is itself sparse. Under suﬃciently many samples, the support of A can be deduced from the initialization stage; therefore we perform an extra projection step in each iteration of gradient descent. In this sense, our method is non-trivially diﬀerent from Arora et al. (2015). The full algorithm is presented as Algorithm 2. As discussed in Section 2, convergence of noisy approximate gradient descent can be achieved as long as bgs is correlated-w.h.p. with the true solution. However, an analogous convergence result for projected gradient descent does not exist in the literature. We ﬁll this gap via a careful analysis. Due to the projection, we only require the correlatedw.h.p. property for part of bgs (i.e., when it is restricted to some support set) with A . The descent property is still achieved via Theorem 5. Due to various perturbation terms, bg is only a biased estimate of AL(A, X); therefore, we can only reﬁne the estimate of A

until the column-wise error is of order O( p

k/n). The performance of Algorithm 2 can be characterized via the following theorem.

Theorem 5 Suppose that the initial estimate A0 has the correct column supports and is (δ, 2)-near to A with δ = O (1/ log n). If Algorithm 2 is provided with p = eΩ(mr) fresh samples at each step and η = Θ(m/k), then

E[ As i A i 2] (1 ρ)s A0 i A i 2 + O( p

for some 0 < ρ < 1/2 and for s = 1, 2, . . . , T. Consequently, As converges to A geometrically until column-wise error O( p

Provably Accurate Double-Sparse Coding

Algorithm 2 Double-Sparse Coding Descent Algorithm

Initialize A0 is (δ, 2)-near to A . H = (hij)n m where hij = 1 if i supp(A0 j) and 0 otherwise. Repeat for s = 0, 1, . . . , T

Decode: x(i) = threshold C/2((As)T y(i)) for i = 1, 2, . . . , p Update: As+1 = PH(As ηbgs) = As ηPH(bgs) where bgs = 1

p Pp i=1(Asx(i) y(i))sgn(x(i))T and PH(G) = H G

We defer the full proof of Theorem 5 to Section D. In this section, we take a step towards understanding the algorithm by analyzing bgs in the inﬁnite sample case, which is equivalent to its expectation gs E[(Asx y)sgn(x)T ]. We establish the (α, β, γs)-correlation of a truncated version of gs i with A i to obtain the descent in Theorem 6 for the inﬁnite sample case.

Theorem 6 Suppose that the initial estimate A0 has the correct column supports and is (δ, 2)-near to A . If Algorithm 2 is provided with inﬁnite number of samples at each step and η = Θ(m/k), then

As+1 i A i 2 (1 ρ) As i A i 2 + O k2/n2

for some 0 < ρ < 1/2 and for s = 1, 2, . . . , T. Consequently, it converges to A geometrically until column-wise error is O(k/n).

Note that the better error O(k2/n2) is due to the fact that inﬁnitely many samples are given. The term O( p

k/n) in Theorem 5 is a trade-oﬀbetween the accuracy and the sample complexity of the algorithm. The proof of this theorem composes of two steps with two main results: 1) an explicit form of gs (Lemma 5); 2) (α, β, γs)-correlation of column-wise gs with A (Lemma 6). The proof of those lemmas are deferred to Appendix C. Since the correlation primarily relies on the (δ, 2)-nearness of As to A that is provided initially and maintained at each step, then we need to argue that the nearness is preserved after each step.

Lemma 5 Suppose that the initial estimate A0 has the correct column supports and is (δ, 2)-near to A . The column-wise update has the form gs R,i = piqi(λs i As R,i A R,i + ξs i ζ) where R = supp(As i), λs i = As i, A i and

ξs i = As R, idiag(qij)(As i)T A i/qi.

Moreover, ξi has norm bounded by O(k/n) for δ = O (1/ log n) and ζ is negligible.

We underline that the correct support of As allows us to obtain the closed-form expression of gs Ri,i in terms of As i and A i. Likewise, the gradient form suggests that gs i is almost equal to piqi(As i A i) (since λs i 1), which directs to the desired solution A i. With Lemma 5, we will prove the (α, β, γs)-correlation of the approximate gradient to each column A i and the nearness of each new update to the true solution A .

Nguyen, Wong, and Hegde

4.1. (α, β, γs)-Correlation

Lemma 6 Suppose that As to be (δ, 2)-near to A and R = supp(A i), then 2gs R,i is (α, 1/2α, ϵ2/α)-correlated with A R,i; that is

2gs R,i, As R,i A R,i α As R,i A R,i 2 + 1/(2α) gs R,i 2 ϵ2/α.

where δ = O (1/ log n) and ϵ = O k2

mn . Futhermore, the descent is achieved by

As+1 i A i 2 (1 2αη)s A0 i A i 2 + ηϵ2/α.

Proof Throughout the proof, we omit the superscript s for simplicity and denote 2α = piqi. First, we rewrite gs i as a combination of the true direction As i A i and a term with small norm:

g R,i = 2α(AR,i A R,i) + v, (2)

where v = 2α[(λi 1)A i + ϵi] with norm bounded. In fact, since A i is δ-close to A i, and both have unit norm, then 2α(λi 1)A i = α A i A i 2 α A i A i and ξi O(k/n) from the inequality (9). Therefore,

v = 2α(λi 1)AR,i + 2αξi α AR,i A R,i + ϵ

where ϵ = O(k2/mn). Now, we make use of (2) to show the ﬁrst part of Lemma 6:

2g R,i, AR,i A R,i = 4α AR,i A R,i 2 + 2v, AR,i A R,i . (3)

We want to lower bound the inner product term with respect to g Ri,i 2 and AR,i A R,i 2. Eﬀectively, from (2)

4α v, A i A i = g R,i 2 4α2 AR,i A R,i 2 v 2

g R,i 2 6α2 AR,i A R,i 2 2ϵ2, (4)

where the last step is due to Cauchy-Schwarz inequality: v 2 2(α2 AR,i A R,i 2 + ϵ2). Substitute 2 v, A i A i in (3) for the right hand side of (4), we get the ﬁrst result:

2g R,i, AR,i A R,i α AR,i A R,i 2 + 1

2α g R,i 2 ϵ2

The second part is directly followed from Theorem 1. Moreover, we have pi = Θ(k/m) and qi = Θ(1), then α = Θ(k/m), β = Θ(m/k) and γs = O(k3/mn2). Then gs R,i is (Ω(k/m), Ω(m/k), O(k3/mn2))-correlated with the true solution A R,i. Proof [Proof of Theorem 6] The descent in Theorem 6 directly follows from the above lemma. Next, we will establish the nearness for the update at step s:

Provably Accurate Double-Sparse Coding

0 2,000 4,000 0

Sample size

Recovery rate

0 2,000 4,000 0

Sample size

Reconstruction error

Ours Arora Arora+HT

0 2,000 4,000 0

Sample size

Running time (s)

0 2,000 4,000 0

Sample size

Recovery rate

0 2,000 4,000 0

Sample size

Reconstruction error

Ours Arora Arora+HT

0 2,000 4,000 0

Sample size

Running time (s)

Figure 1: (top) The performance of four methods on three metrics (recovery rate, reconstruction error and running time (in seconds)) in sample size in the noiseless case. (bottom) The same metrics are measured for the noisy case.

4.2. Nearness

Lemma 7 Suppose that As is (δ, 2)-near to A , then As+1 A 2 A

Proof [Proof] From Lemma 5 we have gs i = piqi(λi As i A i) + A idiag(qij)AT i A i ζ. Denote R = [n]\R, then it is obvious that gs R,i = A R, idiag(qij)AT i A i ζ is bounded by O(k2/m2). Then we follows the proof of Lemma 24 in (Arora et al., 2015) for the nearness with full gs = gs R,i + gs R,i to ﬁnish the proof for this lemma. In sum, we have shown the descent property of Algorithm 2 in the inﬁnite sample case. The study of the concentration of bgs around its mean to the sample complexity is provided in Section D. In the next section, we corroborate our theory by some numerical results on synthetic data.

5. Empirical Study

We compare our method with three diﬀerent methods for both standard sparse and doublesparse coding. For a baseline, we implement the algorithm proposed in Arora et al. (2015), which currently is the most theoretically sound approach for provable sparse coding. However, since their approach is not directly designed for the double-sparsity model, we implement a modiﬁed version that performs a hard thresholding (HT)-based post-processing step in the initialization and learning procedures (which we dub Arora + HT). The ﬁnal comparison is with Trainlets, the proposed approach by Sulam et al. (2016).

Nguyen, Wong, and Hegde

0 2,000 4,000 0

Sample size

Recovery rate

0 2,000 4,000 0

Sample size

Reconstruction error

k=5 k=6 k=7 k=8 k=9

Figure 2: The performance of our method in the noiseless case as the sparsity k varies.

0 2,000 4,000

Sample size

Recovery rate

0 2,000 4,000 0

Sample size

Reconstruction error

C=0.4 C=0.6 C=0.8 C=1.0 C=1.2

Figure 3: The performance of our method in the noiseless case as the thresholding parameter C varies.

We generate a synthetic training dataset according to the model described in Section 2. The base dictionary Φ is the identity matrix of size n = 64, and the square synthesis matrix A has a special block structure with 32 blocks. Each block is of size 2 2 and of form [1 1; 1 1] (i.e., the column sparsity of A is r = 2). The support of x is drawn uniformly over all 6-dimensional subsets of [m], and the nonzero coeﬃcients are randomly set to 1 with equal probability. In our simulations with noise, we add Gaussian noise ε with entrywise variance σ2 ε = 0.01 to each of those above samples. For all the approaches except Trainlets, we use T = 2000 iterations for the initialization procedure, and set the number of steps in the descent stage to 25. Since Trainlets does not have a speciﬁed initialization procedure, we initialize it with a random Gaussian matrix upon which column-wise sparse thresholding is then performed. The learning step of Trainlets2 is executed for 50 iterations, which tolerates its initialization deﬁciency. For each Monte Carlo trial, we uniformly draw p samples, feed these samples to the four diﬀerent algorithms, and observe their ability to reconstruct A . Matlab implementation of our algorithms is available online3.

2. We utilize Trainlets s implementation provided at http://jsulam.cswp.cs.technion.ac.il/home/software/. 3. https://github.com/thanh-isu/double-sparse-coding

Provably Accurate Double-Sparse Coding

5.1. Comparison with Other Approaches

We evaluate these approaches on three metrics as a function of the number of available samples: (i) fraction of trials in which each algorithm successfully recovers the ground truth A ; (ii) reconstruction error; and (iii) running time (in seconds). The synthesis matrix is said to be successfully recovered if the Frobenius norm of the diﬀerence between the estimate b A and the ground truth A is smaller than a threshold which is set to 10 4 in the noiseless case, and to 0.5 in the other. All three metrics are averaged over 100 Monte Carlo simulations. As discussed above, the Frobenius norm is only meaningful under a suitable permutation and sign ﬂip transformation linking b A and A . We estimate this transformation using a simple maximum weight matching algorithm. Speciﬁcally, we construct a weighted bipartite graph with nodes representing columns of A and b A and adjacency matrix deﬁned as G = |A T b A|, where | | is taken element-wise. We compute the optimal matching using the Hungarian algorithm, and then estimate the sign ﬂips by looking at the sign of the inner products between the matched columns.

The results of our experiments are shown in Figure 1 with the top and bottom rows respectively for the noiseless and noisy cases. The two leftmost ﬁgures suggest that all algorithms exhibit a phase transition in sample complexity that occurs in the range of 500-2000 samples. In the noiseless case, our method achieves the phase transition with the fewest number of samples. In the noisy case, our method nearly matches the best sample complexity performance (next to Trainlets, which is a heuristic and computationally expensive). Our method achieves the best performance in terms of (wall-clock) running time in all cases.

5.2. Robustness to Data Assumptions

In this last experiement, we show that our approach is robust to the data assumptions. We numerically study how the initialization and descent algorithms behave when the sparsity k and the thresholding parameter C slightly vary around the groundtruth values. Since our focus is on the recovery property of our approach, we assume that the dictionary size m and sparsity r are known a priori and do not experiement on them.

The results are shown in Figures 2 and 3. When the sparsity and the minimum coeﬃcient are around the true setting, kmodel = 6 and Cmodel = 1.0, our algorithm is still able recover the dictionary perfectly. When these parameters are set more extreme, the phase transition is not obvious but is gradually achieved with more and more samples.

6. Conclusion

In this paper, we have addressed an open theoretical question on learning sparse dictionaries under a special type of generative model. Our proposed algorithm consists of a novel initialization step followed by a descent-style step, both are able to take advantage of the sparse structure. We rigorously demonstrate its eﬃcacy in both sampleand computationcomplexity over existing heuristics as well as provable approaches for double-sparse and regular sparse coding. This results in the ﬁrst known provable approach for double-sparse coding problem with statistical and algorithmic guarantees. Besides, we also show three

Nguyen, Wong, and Hegde

beneﬁts of our approach: neural plausibility, robustness to noise and practical usefulness via the numerical experiments. Nevertheless, several fundamental questions regarding our approach remain. First, our initialization method (in the overcomplete case) achieves its theoretical guarantees under fairly stringent limitations on the sparsity level r. This arises due to our reweighted spectral initialization strategy, and it is an open question whether a better initialization strategy exists (or whether these types of initialization are required at all). Second, our analysis holds for complete (ﬁxed) bases Φ, and it remains open to study the setting where Φ is over-complete. Finally, understanding the reasons behind the very promising practical performance of methods based on heuristics, such as Trainlets, on real-world data remains a very challenging open problem.

Acknowledgments

We would like to thank the anonymous reviewers for the constructive feedback and suggestions. This work is supported in part by the National Science Foundation under the grants CCF-1566281 and DMS-1612985.

Provably Accurate Double-Sparse Coding

Appendix Organization We organize the appendix as follows: we prove the two key lemmas for Theorem 4 of the initialization algorithm 1 in Appendix B. In Appendix C, we prove the result stated in Theorem 5 for the inﬁnite-sample case. The sample complexity results for both stages are proved in Appendix D. Additionally, we prove some extended results from Arora et al. (2015) and for some special cases in Appendices E and F. The ﬁnal section details the neural implementation of our approach.

Appendix A. Useful Result

We start our proof with the following claim, which we will use throughout.

Claim 1 (Maximal row ℓ1-norm) Given that A 2 F = m and A = O( p

m/n), then A T 1,2 = Θ( p

Proof Recall the deﬁnition of the operator norm:

A T 1,2 = sup x =0

x 1 sup x =0

x = A T = O( p

Since A 2 F = m, A T 1,2 A F / n = p

m/n. Combining with the above, we have A T 1,2 = Θ( p

m/n). Along with Assumptions A1 and A3, the above claim implies the number of nonzero entries in each row is O(r). This Claim is an important ingredient in our analysis of our initialization algorithm shown in Section 3.

Appendix B. Analysis of Initialization Algorithm

B.1. Proof of Lemma 1

Recall some important notations: y = A x + ε and two samples

u = A α + εu, v = A α + εv.

Also, recall the very coarse estimate for the sparse code of u with respect to A :

β = A T u = A T A α + A T εu.

We split the proof of Lemma 1 into three steps: 1) we ﬁrst establish useful properties of β with respect to α; 2) we then explicitly derive el in terms of the generative model parameters and β; and 3) we ﬁnally bound the error terms in E based on the ﬁrst result and appropriate assumptions.

Claim 2 In the generative model, x e O(

k) and ε e O(σε n) with high probability.

Proof The claim directly follows from the fact that x is a k-sparse random vector whose nonzero entries are independent sub-Gaussian with variance 1. Meanwhile, ε has n independent Gaussian entries of variance σ2 ε. Despite its simplicity, this claim will be used in many proofs throughout the paper. Note also that in this section we will calculate the expectation over y and often refer probabilistic bounds (w.h.p.) under the randomness of u and v.

Nguyen, Wong, and Hegde

Claim 3 Suppose that u = A α+εu is a random sample and U = supp(α). Let β = A T u, then, w.h.p., we have (a) |βi αi| µk log n n +σε log n for each i and (b) β e O(

Proof The proof mostly follows from Claim 36 of Arora et al. (2015), with an additional consideration of the error εu. Write W = U\{i} and observe that

|βi αi| = |A T i A W αW + A T i εu| | A T W A i, αW | + | A i, εu |

Since A is µ-incoherence, then A T i A W µ p

k/n. Moreover, αW has k 1 independent sub-Gaussian entries of variance 1, therefore | A T W A i, αW | µk log n n with high probability.

Also recall that εu has independent Gaussian entries of variance σ2 ε, then A T i εu is Gaussian with the same variance ( A i = 1). Hence |A T i ε| σε log n with high probability. Consequently, |βi αi| µk log n n + σε log n, which is the ﬁrst part of the claim. Next, in order to bound β , we express β as

β = A T A UαU + A T εu A A U αU + A εu

Using Claim 2 to get αU e O(

k) and εu e O(σε n) w.h.p., and further noticing that A U A O(1) , we complete the proof for the second part. Claim 3 suggests that the diﬀerence between βi and αi is bounded above by O (1/ log2 n) w.h.p. if µ = O ( n k log3 n). Therefore, w.h.p., C o(1) |βi| |αi| + o(1) O(log m) for

i U and |βi| O (1/ log2 n) otherwise. On the other hand, under Assumption B4, β e O(

k) w.h.p. We will use these results multiple times in the next few proofs. Proof [Proof of Lemma 1] We decompose dl into small parts so that the stochastic model D is made use.

el = E[ y, u y, v y2 l ] = E[ A x + ε, u A x + ε, v ( A l , x + εl)2]

= E x , β x , β + x T (βv T + β u T )ε + u T εεT v A l , x 2 + 2 A l , x εl + ε2 l

= E1 + E2 + + E9

where the terms are

E1 = E[ x , β x , β A l , x 2]

E2 = 2E[ x , β x , β A l , x εl]

E3 = E[ x , β x , β ε2 l ]

E4 = E A l , x 2x T (βv T + β u T )ε

E5 = E A l , x x T (βv T + β u T )εεl

E6 = E (βv T + β u T )εε2 l

E7 = E[u T εεT v A l , x 2]

E8 = 2E[u T εεT v A l , x εl]

E9 = E[u T εεT vε2 l ]

Provably Accurate Double-Sparse Coding

Because x and ε are independent and zero-mean, E2 and E4 are clearly zero. Moreover,

E6 = (βv T + β u T )E[εε2 l ] = 0

due to the fact that E[εjε2 l ] = 0, for j = l, and E[ε3 l ] = 0. Also,

E8 = A T l E[x ]E u T εεT vεl = 0.

We bound the remaining terms separately in the following claims.

Claim 4 In the decomposition (5), E1 is of the form

i U V qiciβiβ i A 2 li + X

i/ U V qiciβiβ i A 2 li + X

j =i qij(βiβ i A 2 lj + 2βiβ j A li A lj)

where all those terms except P i U V qiciβiβ i A 2 li have magnitude at most O (k/m log2 n) w.h.p.

Proof Using the generative model in Assumptions B1-B4, we have

E1 = E[ x , β x , β A l , x 2]

= ES Ex |S[ X

i S βix i X

i S β ix i X

i S A lix i 2]

i [m] qiciβiβ i A 2 li + X

i,j [m],j =i qij(βiβ i A 2 lj + 2βiβ j A li A lj)

i U V qiciβiβ i A 2 li + X

i/ U V qiciβiβ i A 2 li + X

j =i qij(βiβ i A 2 lj + 2βiβ j A li A lj),

where we have used the qi = P[i S], qij = P[i, j S] and ci = E[x4 i |i S] and Assumptions B1-B4. We now prove that the last three terms are upper bounded by O (k/m log n). The key observation is that all these terms typically involve a quadratic form of the l-th row A l whose norm is bounded by O(1) (by Claim 1 and Assumption A4). Moreover, |βiβ i| is relatively small for i / U V while qij = Θ(k2/m2). For the second term, we apply the Claim 3 for i [m]\(U V ) to bound |βiβ i| . Assume αi = 0 and α i = 0, then with high probability

|βiβ i| |(βi αi)(β i α i)| + |βiα i| O (1/ log n)

Using the bound qici = Θ(k/m), we have w.h.p., X

i/ U V qiciβiβ i A 2 li max i |qiciβiβ i| X

i/ U V A 2 li max i |qiciβiβ i| A 2 1,2 O (k/m log n).

For the third term, we make use of the bounds on β and β from the previous claim where β β e O(k) w.h.p., and on qij = Θ(k2/m2). More precisely, w.h.p., X

j =i qijβiβ i A 2 lj = X

j =i qij A 2 lj X

i |βiβ i| X

j =i qij A 2 lj

(max i =j qij) X

i |βiβ i| X

j A 2 lj (max i =j qij) β β A 2 1,2 e O(k3/m2),

Nguyen, Wong, and Hegde

where the second last inequality follows from the Cauchy-Schwarz inequality. For the last term, we write it in a matrix form as P j =i qijβiβ j A li A lj = A T l QβA l where (Qβ)ij = qijβiβ j for i = j and (Qβ)ij = 0 for i = j. Then

|A T l QβA l | Qβ A l 2 Qβ F A 2 1,2,

where Qβ 2 F = P i =j q2 ijβ2 i (β j)2 (maxi =j q2 ij) P i β2 i P j(β j)2 (maxi =j q2 ij) β 2 β 2. Ultimately, X

j =i qijβiβ j A li A lj (max i =j qij) β β A 2 1,2 e O(k3/m2).

Under Assumption k = O ( n log n), then e O(k3/m2) O (k/m log2 n). As a result, the two terms above are bounded by the same amount O (k/m log n) w.h.p., so we complete the proof of the claim.

Claim 5 In the decomposition (5), |E3|, |E5|, |E7| and |E9| are at most O (k/m log2 n).

Proof Recall that E[x2 i |S] = 1 and qi = P[i S] = Θ(k/m) for S = supp(x ), then

E3 = E[ x , β x , β ε2 l ] = σ2 εES Ex |S[ X

i,j S βiβ jx i x j]

= σ2 εES[ X

i S βiβ i] = X

i σ2 εqiβiβ i

Denote Q = diag(q1, q2, . . . , qm), then |E3| = |σ2 ε Qβ, β | σ2 ε Q β β e O(σ2 εk2/m) = e O(k3/mn) where we have used β e O(

k) w.h.p. and σε O(1/ n). For convenience, we handle the seventh term before E5:

E7 = E[u T εεT v A l , x 2] = E[ A l , x 2]u T E[εεT ]v = X

i σ2 ε u, v qi A2 li = σ2 ε u, v AT l QAl

To bound this term, we use Claim 9 in Appendix D to have u = A α + εu e O(

k) w.h.p. and u, v e O(

k) w.h.p. Consequently, |E7| σ2 ε Q Al 2| u, v | e O(k2/mn) because Al 2 O(m/n) and σε O(1/ n). Now, the ﬁrth term E5 is expressed as follows

E5 = E A l , x x T (βv T + β u T )εεl

= A T l E x x T (βv T + β u T )E[εεl]

= σ2 εA T l Q(vlβ + ulβ )

Observe that |E5| σ2 ε A T l Q(vlβ + ulβ ) σ2 ε A T l Q vlβ + ulβ and that vlβ + ulβ 2 u β e O(k) w.h.p. using the result u e O(k) and β e O(k) from Claim 3, then E5 bounded by e O(k2/mn). The last term

E9 = E[u T εεT vε2 l ] = u T E εεT ε2 l v = 9σ4 ε u, v

Provably Accurate Double-Sparse Coding

because the independent entries of ε and E[ε4 l ] = 9σ4 ε. Therefore, |E9| 9σ4 ε u v e O(k2/n2). Since m = O(n) and k O ( n log n), we obtain the same bound O (k/m log2 n) for |E3|, |E5|, |E7| and |E9|, and conclude the proof of the claim. Combining the bounds from Claim 4, 5 for every single term in (5), we ﬁnish the proof for Lemma 1.

B.2. Proof of Lemma 2

We prove this lemma by using the same strategy used to prove Lemma 1.

Mu,v E[ y, u y, v y Ry T R]

= E[ A x + ε, u A x + ε, v (A R x + εR)(A R x + εR)T ]

= E x , β x , β + x T (βv T + β u T )ε + u T εεT v A R x x T A T R + A R x εT R + εRx T A T R + εRεT R

= M1 + + M8,

in which only nontrivial terms are kept in place, including

M1 = E[ x , β x , β A R x x T A T R ]

M2 = E[ x , β x , β εRεT R]

M3 = E[x T (βv T + β u T )εA R x εT R]

M4 = E[x T (βv T + β u T )εεRx T A T R ]

M5 = E[u T εεT v A R x x T A T R ]

M6 = E[u T εεT v A R x εT R]

M7 = E[u T εεT vεT Rx T A T R ]

M8 = E[u T εεT vεRεT R]

By swapping inner product terms and taking advantage of the independence, we can show that M6 = E[A R x u T εεT vεT R] = 0 and M7 = E[u T εεT vεT Rx T A T R ] = 0. The remaining are bounded in the next claims.

Claim 6 In the decomposition (6),

i U V qiciβiβ i A R,i A T R,i + E 1 + E 2 + E 3

where E 1 = P i/ U V qiciβiβ i A R,i A T R,i, E 2 = P i =j qijβiβ i A R,j A T R,j and E 3 = P i =j qij(βi A R,iβ j A T R,j+ β i A R,iβj A T R,j) have norms bounded by O (k/m log n).

Proof The expression of M1 is obtained in the same way as E1 is derived in the proof of Lemma 1. To prove the claim, we bound all the terms with respect to the spectral norm of A R and make use of Assumption A4 to ﬁnd the exact upper bound. For the ﬁrst term E 1, rewrite E 1 = A R,SD1A T R,S where S = [m]\(U V ) and D1 is a diagonal matrix whose entries are qiciβiβ i. Clearly, D1 maxi S|qiciβiβ i| O (k/m log n) as shown in Claim 4, then

E 1 max i S |qiciβiβ i| A R,S 2 max i S |qiciβiβ i| A R 2 O (k/m log n)

Nguyen, Wong, and Hegde

where A R,S A R O(1). The second term E 2 is a sum of positive semideﬁnite matrices, and β O(k log n), then

i =j qijβiβ i A R,j A T R,j max i =j qij X

j A R,j A T R,j (max i =j qij) β β A R A T R

which implies that E 2 (maxi =j qij) β β A R 2 e O(k3/m2). Observe that E 3 has the same form as the last term in Claim 4, which is E 3 = A T R QβA R . Then

E 3 Qβ A R 2 (max i =j qij) β β A R 2 e O(k3/m2)

By Claim 3, we have β and β are bounded by O(

k log n), and note that k O ( n/ log n), then we complete the proof for Lemma 6.

Claim 7 In the decomposition (6), M2, M3, M4, M5 and M8 have norms bounded by O (k/m log n).

Proof Recall the deﬁnition of Q in Claim 5 and use the fact that E[x x T ] = Q, we can get M2 = E[ x , β x , β εRεT R] = P i σ2 εqiβiβ i Ir. Then, M2 σ2 ε maxi qi β β O(σ2 εk2 log2 n/m). The next three terms all involve A R whose norm is bounded according to Assumption A4. Speciﬁcally,

M3 = E[x T (βv T + β u T )εA R x εT R] = E[A R x x T (βv T + β u T )εεT R]

= A R E[x x T ](βv T + β u T )E[εεT R]

= A R Q(βv T + β u T )E[εεT R],

M4 = E[x T (βv T + β u T )εεRx T A T R ] = E[εRεT (vβT + uβ T )x x T A T R ]

= E[εRεT ](vβT + uβ T )E[x x T ]A T R = E[εRεT ](vβT + uβ T )QA T R ,

and the ﬁfth term M5 = E[u T εεT v A R x x T A T R ] = σ2 εu T v A R E[x x T ]A T R = σ2 εu T v A R QA T R . We already have E[εεT R] = σ2 ε, Q O(k/m) and |u T v| e O(k) (proof of Claim 9), then the remaining work is to bound βv T +β u T , then the bound of vβT +uβ T directly follows. We have βv T = A uv T A u v e O(k). Therefore, all three terms M3, M4 and M5 are bounded in norm by e O(σ2 εk2/m) e O(k3/mn). The remaining term is

M8 = E[u T εεT vεRεT R] = E[ X

i,j uivjεiεj εRεT R]

i R uiviε2 i εRεT R ] + E[ X

i =j uivjεiεj εRεT R]

= σ4 εu Rv T R

Provably Accurate Double-Sparse Coding

where u R = A R α + (εu)R and v R = A R α + (εv)R. We can see that u R A R α + (εu)R e O(

k). Therefore, M8 e O(σ4 εk) = e O(k3/n2). Since m = O(n) and k O ( n log n), then we can bound all the above terms by O (k/m log n) and ﬁnish the proof of Claim 7. Combine the results of Claim 6 and 7, we complete the proof of Lemma 2.

Appendix C. Analysis of Main Algorithm

C.1. Simple Encoding

We can see that (Asx y)sgn(x)T is random over y and x that is obtained from the encoding step. We follow (Arora et al., 2015) to derive the closed form of gs = E[(Asx y)sgn(x)T ] by proving that the encoding recovers the sign of x with high probability as long as As is close enough to A .

Lemma 8 Assume that As is δ-close to A for δ = O(r/n log n) and µ n

2k , and k Ω(log m) then with high probability over random samples y = A x + ε

sgn(threshold C/2 (As)T y = sgn(x ) (7)

Proof [Proof of Lemma 8] We follow the same proof strategy from (Arora et al., 2015) (Lemmas 16 and 17) to prove a more general version in which the noise ε is taken into account. Write S = supp(x ) and skip the superscript s on As for the readability. What we need is to show S = {i [m] : A i, y C/2} and then sgn( As i, y ) = sgn(x i ) for each i S with high probability. Following the same argument of (Arora et al., 2015), we prove in below a stronger statement that, even conditioned on the support S, S = {i [m] : | A i, y | C/2} with high probability. Rewrite

A i, y = A i, A x + ε = A i, A i x i + X

j =i A i, A j x j + A i, ε ,

and observe that, due to the closeness of A i and A i, the ﬁrst term is either close to x i or equal to 0 depending on whether or not i S. Meanwhile, the rest are small due to the incoherence and the concentration in the weighted average of noise. We will show that both Zi = P S\{i} A i, A j x j and A i, ε are bounded by C/8 with high probability. The cross-term Zi = P S\{i} A i, A j x j is a sum of zero-mean independent sub-Gaussian random variables, which is another sub-Gaussian random variable with variance σ2 Zi = P S\{i} A i, A j 2. Note that

A i, A j 2 2 A i, A j 2 + A i A i, A j 2 2µ2/n + 2 A i A i, A j 2,

where we use Cauchy-Schwarz inequality and the µ-incoherence of A . Therefore,

σ2 Zi 2µ2k/n + 2 A T S(A i A i) 2 F 2µ2k/n + 2 A S 2 A i A i 2 O(1/ log n),

2k , to conclude 2µ2k/n O(1/ log n) we need 1/k = O(1/ log n), i.e. k = Ω(log n). Applying Bernstein s inequality, we get |Zi| C/8 with high probability. What

Nguyen, Wong, and Hegde

remains is to bound the noise term A i, ε . In fact, A i, ε is sum of n Gaussian random variables, which is a sub-Gaussian with variance σ2 ε. It is easy to see that | A i, ε | σε log n with high probability. Notice that σε = O(1/ n). Finally, we combine these bounds to have |Zi + A i, ε | C/4. Therefore, for i S, then | A i, y | C/2 and negligible otherwise. Using union bound for every i = 1, 2, . . . , m, we ﬁnish the proof of the Lemma. Lemma 8 enables us to derive the expected update direction gs = E[(Asx y)sgn(x)T ] explicitly.

C.2. Approximate Gradient in Expectation

Proof [Proof of Lemma 5] Having the result from Lemma 8, we are now able to study the expected update direction gs = E[(Asx y)sgn(x)T ]. Recall that As is the update at the s-th iteration and x threshold C/2((As)T y). Based on the generative model, denote pi = E[x i sgn(x i )|i S], qi = P[i S] and qij = P[i, j S]. Throughout this section, we will use ζ to denote any vector whose norm is negligible although they can be diﬀerent across their appearances. A i denotes the sub-matrix of A whose i-th column is removed. To avoid overwhelming appearance of the superscript s, we skip it from As for neatness. Denote Fx is the event under which the support of x is the same as that of x , and Fx is its complement. In other words, 1Fx = 1[sgn(x) = sgn(x )] and 1Fx + 1 Fx = 1.

gs i = E[(Ax y)sgn(xi)] = E[(Ax y)sgn(xi)1Fx ] ζ

Using the fact that y = A x + ε and that under Fx we have Ax = A Sx S = A SAT Sy = A SAT SA x + A SAT Sε. Using the independence of ε and x to get rid of the noise term, we get

gs i = E[(A SAT S In)A x 1Fx ] + E[(A SAT S In)εsgn(xi)1Fx ] ζ

= E[(A SAT S In)A x sgn(xi)1Fx ] ζ (Independence of ε and x s)

= E[(A SAT S In)A x sgn(x i )(1 1 Fx )] ζ (Under Fx event)

= E[(A SAT S In)A x sgn(x i )] ζ

Recall from the generative model assumptions that S = supp(x ) is random and the entries of x are pairwise independent given the support, so

gs i = ESEx |S[(A SAT S In)A x sgn(x i )] ζ

= pi ES,i S[(A SAT S In)A i] ζ

= pi ES,i S[(A i AT i In)A i] + pi ES,i S[ X

l S,l =i A l AT l A i] ζ

= piqi(A i AT i In)A i + pi X

l [m],l =i qil A l AT l A i ζ

= piqi(λi A i A i) + pi A idiag(qij)AT i A i ζ

where λs i = As i, A i . Let ξs i = AR, idiag(qij)AT i A i/qi for j = 1, . . . , m, we now have the full expression of the expected approximate gradient at iteration s:

gs R,i = piqi(λi As R,i A R,i + ξs i ) ζR. (8)

Provably Accurate Double-Sparse Coding

What remains is to bound norms of ξs and ζ. We have As R, i As i O( p

m/n) w.h.p. Then, along with the fact that A i = 1, we can bound ξs i

ξs i As Ri, i max j =i qij

qi As i O(k/n). (9)

Next, we show that norm of ζ is negligible. In fact, Fx happens with very high probability, then it suﬃces to bound norm of (Ax y)sgn(xi) which will be done using Lemma 12 and Lemma 11 in Section D. This concludes the proof for Lemma 5.

Appendix D. Sample Complexity

In previous sections, we rigorously analyzed both initialization and learning algorithms as if the expectations gs, e and Mu,v were given. Here we show that corresponding estimates based on empirical means are suﬃcient for the algorithms to succeed, and identify how may samples are required. Technically, this requires the study of their concentrations around their expectations. Having had these concentrations, we are ready to prove Theorems 4 and 5. The entire section involves a variety of concentration bounds. Here we make heavy use of Bernstein s inequality for diﬀerent types of random variables (including scalar, vector and matrix). The Bernstein s inequality is stated as follows.

Lemma 9 (Bernstein s Inequality) Suppose that Z(1), Z(2), . . . , Z(p) are p i.i.d. samples from some distribution D. If E[Z] = 0, Z(j) R almost surely and E[Z(j)(Z(j))T σ2

for each j, then

j=1 Z(j) e O R

holds with probability 1 n ω(1).

Since all random variables (or their norms) are not bounded almost surely in our model setting, we make use of a technical lemma that is used in Arora et al. (2015) to handle the issue.

Lemma 10 (Arora et al. (2015)) Suppose a random variable Z satisﬁes P[ Z R(log(1/ρ))C] ρ for some constant C > 0, then (a) If p = n O(1), it holds that Z(j) e O(R) for each j with probability 1 n ω(1). (b) E[Z1 Z eΩ(R)] = n ω(1).

This lemma suggests that if 1

p Pp i=1 Z(j)(1 1 Z(j) eΩ(R)) concentrates around its mean

with high probability, then so does 1

p Pp i=1 Z(j) because the part outside the truncation level can be ignored. Since all random variables of our interest are sub-Gaussian or a product of sub-Gaussian that satisfy this lemma, we can apply Lemma 9 to the corresponding truncated random variables with carefully chosen truncation levels. Then the original random variables concentrate likewise. In the next proofs, we deﬁne suitable random variables and identify good bounds of R and σ2 for them. Note that in this section, the expectations are taken over y by conditioning

Nguyen, Wong, and Hegde

on u and v. This aligns with the construction that the estimators of e and Mu,v are empirical averages over i.i.d. samples of y, while u and v are kept ﬁxed. Due to the dependency on u and v, these (conditional) expectations inherit randomness from u and v, and we will formulate probabilistic bounds for them. The application of Bernstein s inequality requires a bound on E[ZZT (1 1 Z eΩ(R))] .

We achieve that by the following technical lemma, where Z is a standardized version of Z.

Lemma 11 Suppose a random variable Z ZT = a T where a 0 and T is positive semideﬁnite. They are both random. Suppose P[a A] = n ω(1) and B > 0 is a constant. Then, E[ Z ZT (1 1 Z B)] A E[T] + O(n ω(1))

Proof To show this, we make use of the decomposition Z ZT = a T and a truncation for a. Speciﬁcally,

E[ Z ZT (1 1 Z B)] = E[a T(1 1 Z B)]

E[a(1 1a A)T(1 1 Z B)] + E[a1a AT(1 1 Z B)]

E[a(1 1a A)T] + E[a1a A T (1 1 Z B)]

A E[T] + E[ a T 2(1 1 Z B)]E[1a A] 1/2

A E[T] + E[ Z 4(1 1 Z B)]P[a A] 1/2

A E[T] + B2 P[a A] 1/2

A E[T] + O(n ω(1)),

where at the third step we used T(1 1 Z B)] T because of the fact that T is the positive semi-deﬁnite and 1 1 Z B {0, 1} . Then, we ﬁnish the proof of the lemma.

D.1. Sample Complexity of Algorithm 1

In Algorithm 1, we empirically compute the scores be and the reduced weighted covariance matrix c Mu,v to produce an estimate for each column of A . Since the construction of c Mu,v depends upon the support estimate b R given by ranking be, we denote it by c M b R u,v. We will show that we only need p = e O(m) samples to be able to recover the support of one particular atom and up to some speciﬁed level of column-wise error with high probability.

Lemma 12 Consider Algorithm 1 in which p is the given number of samples. For any pair u and v, then with high probability a) be e O (k/m log2 n) when p = eΩ(m) and b) c M b R u,v MR u,v O (k/m log n) when p = eΩ(mr) where b R and R are respectively the estimated and correct support sets of one particular atom.

D.1.1. Proof of Theorem 4

Using Lemma 12, we are ready to prove the Theorem 4. According to Lemma 1 when U V = {i}, we can write be as

be = qiciβiβ i A R,i A R,i + perturbation terms + (be e),

Provably Accurate Double-Sparse Coding

and consider be e as an additional perturbation with the same magnitude O (k/m log2 n) in the sense of w.h.p. The ﬁrst part of Lemma 3 suggests that when u and v share exactly one atom i, then the set b R including r largest elements of be is the same as supp(A i ) with high probability. Once we have b R, we again write c M b R u,v using Lemma 2 as

c M b R u,v = qiciβiβ i A R,i A T R,i + perturbation terms + (c M b R u,v MR u,v),

and consider c M b R u,v MR u,v as an additional perturbation with the same magnitude O (k/m log n) in the sense of the spectral norm w.h.p. Using the second part of Lemma 3, we have the top singular vectors of c M b R u,v is O (1/ log n) -close to A R,i with high probability. Since every vector added to the list L in Algorithm 1 is close to one of the dictionary, then A0 must be δ-close to A . In addition, the nearness of A0 to A is guaranteed via an appropriate projection onto the convex set B = {A|A close to A0 and A 2 A }. Finally, we ﬁnish the proof of Theorem 4.

D.1.2. Proof of Lemma 12, Part a

For some ﬁxed l [n], consider p i.i.d. realizations Z(1), Z(2), . . . , Z(p) of the random variable Z y, u y, v y2 l , then bel = 1 p Pp i=1 Z(i) and el = E[Z]. To show that be e O (k/m log2 n) holds with high probability, we ﬁrst study the concentration for the l-th entry of be e and then take the union bound over all l = 1, 2, . . . , n. We derive upper bounds for |Z| and its variance E[Z2] in order to apply Bernstein s inequality in (12) to the truncated version of Z.

Claim 8 |Z| e O(k) and E[Z2] e O(k2/m) with high probability.

Again, the expectation is taken over y by conditioning on u and v, and therefore is still random due to the randomness of u and v. To show Claim 8, we begin with proving the following auxiliary claim.

Claim 9 y e O(

k) and | y, u | e O(

k) with high probability.

Proof From the generative model, we have

y = A Sx S + ε A Sx S + ε A S x S + ε ,

where S = supp(x ). From Claim 2, x S e O(

k) and ε e O(σε n) w.h.p. In addition, A is overcomplete and has bounded spectral norm, then A S A O(1). Therefore, y e O(

k) w.h.p., which is the ﬁrst part of the proof. To bound the second term, we write it as | y, u | = | A Sx S + ε, u | | x S, A T Su | + | ε, u |.

Similar to y, we have u e O(

k) w.h.p. and hence A T Su A T S u O(

k) with high probability. Since u and x are independent sub-Gaussian and x S, A T Su are subexponential with variance at most O(

k), | x S, A T Su | e O(k) w.h.p. Similarly, | ε, u | e O(

k) w.h.p. Consequently, | y, u | e O(

k) w.h.p., and we conclude the proof of the claim.

Nguyen, Wong, and Hegde

Proof [Proof of Claim 8] We have Z = y, u y, v y2 l = y, u y, v ( A l , x + εl)2 with y, u y, v e O(k) w.h.p. according to Claim 9. What remains is to bound y2 l = ( A l , x + εl)2. Because A l , x is sub-Gaussian with variance ES(P i S A 2 li ) A T 2 1,2 = O(1), then | A l , x | O(log n) w.h.p. Similarly for εl, |εl| O(σε log n) w.h.p. Ultimately, | A l , x +εl| O(log n), and hence we obtain with high probability the bound |Z| e O(k). To bound the variance term, we write Z2 = y, v 2y2 l y, u 2y2 l . Note that, from the ﬁrst part, we get y, v 2y2 l e O(k) and |Z| e O(k) w.h.p.. We apply Lemma 11 with some appropriate scaling to both terms, then

E[Z2(1 1|Z| eΩ(k))] e O(k)E[ y, u 2y2 l ] + O(n ω(1)),

where E[ y, u 2y2 l ] is equal to el for pair u, v with v = u. From Lemma 1 and its proof in Appendix Section Analysis of Initialization Algorithm ,

E[ y, u 2y2 l ] =

i=1 qiciβ2 i A 2 li + perturbation terms,

in which the perturbation terms are bounded by O (k/m log2 n) w.h.p. (following Claims 4 and 5). The dominant term P i qiciβ2 i A 2 li (max qiciβ2 i ) A l 2 e O(k/m) w.h.p. because |βi| O(log m) (Claim 3). Then we complete the proof of the second part. Proof [Proof of Lemma 12, Part a] We are now ready to prove Part a of Lemma 12. We apply Bernstein s inequality in Lemma 9 for the truncated random variable Z(i)(1 1|Z(i)| eΩ(R)) with R = e O(k) and variance σ2 = e O(k2/m) from Claim 8, then

i=1 Z(i)(1 1|Z(i)| eΩ(R)) E[Z(1 1|Z| eΩ(R))] e O(k)

p O (k/m log n),

(11) w.h.p. for p = eΩ(m). Then bel = 1

p Pp i=1 Z(i) also concentrates with high probability. Take the union bound over l = 1, 2, . . . , n, we get be e O (k/m log n) with high probability and complete the proof of 12, Part a.

D.1.3. Proof of Lemma 12, Part b

Next, we will prove that c M b R u,v MR u,v O (k/m log n) with high probability. We only need to prove the concentration inequalities for the case when conditioned on the event that b R is equivalent to R w.h.p. Again, what we need to derive are an upper norm bound R of the matrix random variable Z y, u y, v y Ry T R and its variance.

Claim 10 Z e O(kr) and E[ZZT ] e O(k2r/m) hold with high probability.

Proof We have Z | y, u y, v | y R 2 with | y, u y, v | e O(k) w.h.p. (according to Claim 9) whereas y R 2 = P i R y2 l O(r log2 n) w.h.p. because yl O(log n) w.h.p. (proof of Claim 8). This implies Z e O(kr) w.h.p. The second part is handled similarly as in the proof of Claim 8. We take advantage of the bounds of c Mu,v in Lemma 2. Speciﬁcally, using the ﬁrst part Z e O(kr) and y, v 2 y R 2 e O(kr), and applying Lemma 11, then

E[ZZT (1 1 Z eΩ(kr))] e O(kr) E[ y, u 2y Ry T R] + e O(kr)O(n ω(1)) e O(kr) Mu,u ,

Provably Accurate Double-Sparse Coding

where Mu,u arises from the application of Lemma 2. Recall that

i qiciβ2 i A R,i A T R,i + perturbation terms,

where the perturbation terms are all bounded by O (k/m log n) w.h.p. by Claims 6 and 7. In addition,

i qiciβ2 i A R,i A T R,i (max i qiciβ2 i ) A R 2 e O(k/m) A 2 e O(k/m)

w.h.p. Finally, the variance bound is e O(k2r/m) w.h.p. Then, applying Bernstein s inequality in Lemma 9 to the truncated version of Z with R = e O(kr) and variance σ2 = e O(k2r/m) and obtain the concentration for the full Z to get

c MR u,v MR u,v e O(kr)

p O (k/m log n)

w.h.p. when the number of samples is p = eΩ(mr) under Assumption A4.1. We have proved that c MR u,v MR u,v O (k/m log n) as conditioned on the support

consistency event holds w.h.p. c M b R u,v MR u,v O (k/m log n) is easily followed by the law of total probability through the tail bounds on the conditional and marginal probabilities (i.e. P[ c MR u,v MR u,v O (k/m log n)| b R = R]) and P[ b R = R]. We ﬁnish the proof of Lemma 12, Part b for both cases of the spectral bounds.

D.2. Proof of Theorem 5 and Sample Complexity of Algorithm 2

In this section, we prove Theorem 5 and identify sample complexity per iteration of Algorithm 2. We divide the proof into two steps: 1) show that when As is (δs, 2)-near to A for δs = O (1/ log n), the approximate gradient estimate bgs is (α, β, γs)-correlated-whp with A with γs O(k2/mn) + αo(δ2 s) , and 2) show that the nearness is preserved at each iteration. These correspond to showing the following lemmas:

Lemma 13 At iteration s of Algorithm 2, suppose that As has each column correctly supported and is (δs, 2)-near to A and that η = O(m/k). Denote R = supp(As i), then the update bgs R,i is (α, β, γs)-correlated-whp with A R,i where α = Ω(k/m), β = Ω(m/k) and γs O(k2/mn) + αo(δ2 s) for δs = O (1/ log n).

Note that this is a ﬁnite-sample version of Lemma 6.

Lemma 14 If As is (δs, 2)-near to A and number of samples used in step s is p = eΩ(m), then with high probability As+1 A 2 A .

Proof [Proof of Theorem 5] The correlation of bgi with A i , described in Lemma 13, implies the descent of column-wise error according to Theorem 1. Along with Lemma 14, the theorem follows directly.

Nguyen, Wong, and Hegde

D.2.1. Proof of Lemma 13

We prove Lemma 13 by obtaining a tail bound on the diﬀerence between bgs R,i and gs R,i using the Bernstein s inequality in Lemma 9.

Lemma 15 At iteration s of Algorithm 2, suppose that As has each column correctly supported and is (δs, 2)-near to A . For R = supp(As i) = supp(A i ), then bgs R,i gs R,i O(k/m) (o(δs)+O(ϵs)) with high probability for δs = O (1/ log n) and ϵs = O( p

k/n) when p = eΩ(m + σ2 ε mnr

To prove this lemma, we study the concentration of bgs R,i, which is a sum of random vector of the form (y Ax)Rsgn(xi). We consider random variable Z (y Ax)Rsgn(xi)|i S, with S = supp(x ) and x = threshold C/2(AT y). Then, using the following technical lemma to bridge the gap in concentration of the two variables. We adopt this strategy from Arora et al. (2015) for our purpose.

Claim 11 Suppose that Z(1), Z(2), . . . , Z(N) are i.i.d. samples of the random variable Z = (y Ax)Rsgn(xi)|i S. Then,

j=1 Z(j) E[Z] o(δs) + O(ϵs) (12)

holds with probability when N = eΩ(k + σ2 εnr), δs = O (1/ log n) and ϵs = O( p

Proof [Proof of Lemma 15] Once we have done the proof of Claim 11, we can easily prove Lemma 15. We recycle the proof of Lemma 43 in Arora et al. (2015). Write W = {j : i supp(x (j))} and N = |W|, then express bg R,i as

j (y(j) Ax(j))Rsgn(x(j) i ),

where 1 |W| P j(y(j) Ax(j))Rsgn(x(j) i ) is distributed as 1

N PN j=1 Z(j) with N = |W|. Note that E[(y Ax)Rsgn(xi)] = E[(y Ax)Rsgn(xi)1i S] = E[Z]P[i S] = qi E[Z] with qi = Θ(k/m). Following Claim 11, we have

bgs R,i gs R,i O(k/m) 1

j=1 Z(j) E[Z] O(k/m) (o(δs) + O(ϵs)),

holds with high probability as p = Ω(m N/k). Substituting N in Claim 11, we obtain the results in Lemma 15. Proof [Proof of Claim 11] We are now ready to prove the claim. What we need are good bounds for Z and its variance, then we can apply Bernstein s inequality in Lemma 9 for the truncated version of Z, then Z is also concentrates likewise.

Claim 12 Z R holds with high probability for R = e O(δs

k + µk/ n + σε r) with δs = O (1/ log n).

Provably Accurate Double-Sparse Coding

Proof From the generative model and the support consistency of the encoding step, we have y = A x + ε = A Sx S + ε and x S = AT Sy = AT SA Sx S + AT Sε. Then,

(y Ax)R = (A R,Sx S + εR) AR,SAT SA Sx S AR,SAT Sε

= (A R,S AR,S)x S + AR,S(Ik AT SA S)x S + (In A SAT S)R ε.

Using the fact that x S and ε are sub-Gaussian and that Mw e O(σw M F ) holds with high probability for a ﬁxed M and a sub-Gaussian w of variance σ2 w, we have

(y Ax)Rsgn(xi) e O( A R,S AR,S F + AR,S(Ik AT SA S) F +σε (In A SAT S)R F ).

Now, we need to bound those Frobenius norms. The ﬁrst quantity is easily bounded as

A R,S AR,S F A S A S F δs

since A is δs-close to A . To handle the other two, we use the fact that UV F U V F . Using this fact for the second term, we have

AR,S(Ik AT SA S) F AR,S (Ik AT SA S) F ,

where AR,S AR O(1) due to the nearness. The second part is rearranged to take advantage of the closeness and incoherence properties:

Ik AT SA S F Ik A T SA S (A S A S)T A S F Ik A T SA S F + (A S A S)T A S F Ik A T SA S F + A S A S A S F µk/ n + O(δs

where we have used Ik A T SA S F µk/ n because of the µ-incoherence of A , A S A S F δs

k in (13) and A S A O(1). Accordingly, the second Frobenius norm is bounded by AR,S(Ik AT SA S) F O µk/ n + δs

The noise term is handled using the eigen-decomposition UΛUT of A SAT S, then with high probability

(In A SAT S)R F = (UUT UΛUT )R F = UR (In Λ) F In Λ UR F O( r), (15) where the last inequality In Λ O(1) follows by A S A A A + A 3 A O(1) due to the nearness. Putting (13), (14) and (15) together, we obtain the bounds in Claim 12. Next, we determine a bound for the variance of Z.

Claim 13 E[ Z 2] = E[ (y Ax)Rsgn(xi) 2|i S] σ2 holds with high probability for σ2 = O(δ2 sk + k2/n + σ2 εr) with δs = O (1/ log n).

Nguyen, Wong, and Hegde

Proof We explicitly calculate the variance using the fact that x S is conditionally independent given S, and so is ε. x S and ε are also independent and have zero mean. Then we can decompose the norm into three terms in which the dot product is zero in expectation and the others can be shortened using the fact that E[x Sx T S ] = Ik, E[εεT ] = σεIn.

E[ (y Ax)Rsgn(xi) 2|i S] = E[ (A R,S AR,SAT SA S)x S + (In A SAT S)R ε 2|i S]]

= E[ A R,S AR,SAT SA S 2 F |i S] + σ2 εE[ In A SAT S)R 2 F |i S].

Then, by re-writing A R,S AR,SAT SA S as before, we get the form (A R,S AR,S)+AR,S(Ik AT SA S) in which the ﬁrst term has norm bounded by δs

k. The second is further decomposed as

E[ AR,S(Ik AT SA S) 2 F |i S] sup S AR,S 2E[ Ik AT SA S 2 F |i S], (16)

where sup S AR,S AR O(1). We will bound E[ Ik AT SA S 2 F |i S] O(kδ2 s) + O(k2/n) using the proof from Arora et al. (2015):

E[ Ik AT SA S 2 F |i S] = E[ X

j S (1 AT j A j)2 + X

j S AT j A , j 2|i S]

1 4 A j A j 2] + qij X

j =i AT j A , j 2 + qi AT i A , i 2 + qi AT , i A i 2,

where A , i is the matrix A with the i-th column removed, qij O(k2/m2) and qi O(k/m). For any j = 1, 2, . . . , m,

AT j A , j 2 = A T j A , j + (A j A j)T A , j 2

l =j A j, A l 2 + (A j A j)T A , j 2

l =j A j, A l 2 + A j A j 2 A , j 2 µ2 + δ2 s.

The last inequality invokes the µ-incoherence, δ-closeness and the spectral norm of A . Similarly, we come up with the same bound for AT i A , i 2 and AT , i A i 2. Consequently,

E[ Ik AT SA S 2 F |i S] O(kδ2 s) + O(k2/n). (17)

For the last term, we invoke the inequality (15) (Claim 12) to get

E[ (In A SAT S)R 2 F |i S] r (18)

Putting (16), (17) and (18) together and using AR 1, we obtain the variance bound of Z: σ2 = O(δ2 sk + k2/n + σ2 εr) with δs = O (1/ log n) . Finally, we complete the proof. We now apply truncated Bernstein s inequality to the random variable Z(j)(1 1 Z(j) Ω(R))

with R and σ2 in Claims 12 and 13, which are R = e O(δs

k + µk/ n + σε r) and σ2 = O(δ2 sk + k2/n + σ2 εr). Then, (1/N) PN j= Z(j) also concentrates:

i=1 Z(j) E[Z] e O R

= o(δs) + O( p

Provably Accurate Double-Sparse Coding

holds with high probability when N = eΩ(k + σ2 εnr). Then, we ﬁnally ﬁnish the proof of Claim 11. Proof [Proof of Lemma 13] With Claim 11, we study the concentration of bgs R,i around its mean gs R,i. Now, we consider this diﬀerence as an error term of the expectation gs R,i and using Lemma 6 to show the correlation of bgs R,i. Using the expression in Lemma 5 with high probability, we can write

bgs R,i = gs R,i + (gs R,i bgs R,i) = 2α(AR,i A R,i) + v,

where v α AR,i A R,i + O(k/m) (o(δs) + O(ϵs)). By Lemma 6, we have bgs R,i is (α, β, γs)-correlated-whp with A R,i where α = Ω(k/m), β = Ω(m/k) and γs O(k/m) (o(δs) + O( p

k/n)) , then we have done the proof Lemma 13.

D.2.2. Proof of Lemma 14

We have shown the correlation of bgs with A w.h.p. and established the descent property of Algorithm 2. The next step is to show that the nearness is preserved at each iteration. To prove As+1 A 2 A holds with high probability, we recall the update rule

As+1 = As ηPH(bgs),

where PH(bgs) = H bgs. Here H = (hij) where hij = 1 if i supp(A j) and hij = 0 otherwise. Also, note that As is (δs, 2)-near to A for δs = O (1/ log n). We already proved that this holds for the exact expectation gs in Lemma 7. To prove for bgs, we again apply matrix Bernstein s inequality to bound PH(gs) PH(bgs) by O(k/m) because η = Θ(m/k) and A = O(1). Consider a matrix random variable Z PH((y Ax)sgn(x)T ). Our goal is to bound the spectral norm Z and, both E[ZZT ] and E[ZT Z] since Z is asymmetric. To simplify our notations, we denote by x R the vector x by zeroing out the elements not in R. Also, denote Ri = supp(hi) and S = supp(x). Then Z can be written explicitly as

Z = [(y Ax)R1sgn(x1), . . . , (y Ax)Rmsgn(xm)],

where many columns are zero since x is k-sparse. The following claims follow from the proof of Claim 42 in Arora et al. (2015). Here we state and detail some important steps.

Claim 14 Z e O(k) holds with high probability.

Proof With high probability

i S (y Ax)Risgn(xi) 2

where we use Claim 12 with (y Ax)R e O(δs

k) w.h.p., then Z e O(k) holds w.h.p.

Claim 15 E[ZZT ] O(k2/n) and E[ZT Z] e O(k2/n) with high probability.

Nguyen, Wong, and Hegde

Proof The ﬁrst term is easily handled. Speciﬁcally, with high probability

E[ZZT ] E[ X

i S (y Ax)Risgn(xi)2(y Ax)T Ri] = E[ X

i S (y Ax)Ri(y Ax)T Ri] O(k2/n),

where the last inequality follows from the proof of Claim 42 in Arora et al. (2015), which is tedious to be repeated. To bound E[ZT Z] , we use bound of the full matrix (y Ax)sgn(x)T . Note that y Ax e O(

k) w.h.p. is similar to what derived in Claim 12. Then with high probability,

E[ZT Z] E[sgn(x)(y Ax)T (y Ax)sgn(x)T ] e O(k) E[sgn(x)sgn(x)T ] e O(k2/m).

where E[sgn(x)sgn(x)T ] = diag(q1, q2, . . . , qm) has norm bounded by O(k/m). We now can apply Bernstein s inequality for the truncated version of Z with R = e O(k) and σ2 = e O(k2/m), then with p = e O(m),

PH(gs) PH(bgs) e O(k)

holds with high probability. Finally, we invoke the bound η = O(m/k) and complete the proof.

Appendix E. A Special Case: Orthonormal A

We extend our results for the special case where the dictionary is orthonormal. As such, the dictionary is perfectly incoherent and bounded (i.e., µ = 0 and A = 1).

Theorem 7 Suppose that A is orthonormal. When p1 = eΩ(n) and p2 = eΩ(nr), then with high probability Algorithm 1 returns an initial estimate A0 whose columns share the same support as A and with (δ, 2)-nearness to A with δ = O (1/ log n). The sparsity of A can be achieved up to r = O min( n log2 n, n k2 log2 n) .

We use the same initialization procedure for this special case and achieve a better order of r. The proof of Theorem 7 follows the analysis for the general case with following two results:

Claim 16 (Special case of Claim 3) Suppose that u = A α + εu is a random sample and U = supp(α). Let β = A T u, then w.h.p., we have (a) |βi αi| σε log n for each i and (b) β O(

k log n + σε n log n).

Proof We have β = A T u = α+A T ϵu, then βi αi = A i, ϵu and β α = ϵu . Using probability bounds of A i, ϵu , ϵu and α in Claim 2, we have the claim proved. We draw from the claim that for any i / U V , |βiβ i| O(σε log2 n) and have the following result:

Provably Accurate Double-Sparse Coding

Lemma 16 Fix samples u and v and suppose that y = A x + ε is a random sample independent of u, v. The expected value of the score for the lth component of y is given by:

el E[ y, u y, v y2 l ] = X

i U V qiciβiβ i A 2 li + perturbation terms

where qi = P[i S], qij = P[i, j S] and ci = E[x4 i |i S]. Moreover, the perturbation terms have absolute value at most O k/n log2 n max(1/ n, k2/n) .

Proof Lemma follows Lemma 1 via Claim 3 except that the second term of E1 is bounded by O(k log2 n/n3/2).

Appendix F. Extensions of Arora et al. (2015)

F.1. Sample complexity in noisy case

In this section, we study the sample complexity of the algorithms in Arora et al. (2015) in the presence of noise. While noise with order σε = O(1/ n) does not change the sample complexity of the initialization algorithm, it aﬀects that of the descent stage. The analysis involves producing a sharp bound for bgs ,i gs i .

Lemma 17 For a regular dictionary A , suppose As is (δs, 2)-near to A with δs = O (1/ log n), then with high probability bgs ,i gs i O(k/m) (o(δ)+O( p

k/n)) when p = eΩ(m+σ2 ε mn2

Proof This follows directly from Lemma 15 where r = n. We tighten the original analysis to obtain the complexity eΩ(m) instead of eΩ(mk) for the noiseless case. Putting together with p = eΩ(mk) required by the initialization, we then have the overall sample complexity e O(mk+σ2 ε mn2

k ) for the algorithms in Arora et al. (2015) in the noise regime.

F.2. Extension of Arora et al. (2015) s initialization algorithm for sparse case

We study a simple and straightforward extension of the initialization algorithm of Arora et al. (2015) for the sparse case. This extension is produced by adding an extra projection, and is described in Figure 3. The recovery of the support of A is guaranteed by the following Lemma:

Lemma 18 Suppose that z Rn is r-sparse whose nonzero entries are at least τ in magnitude. Provided z is δ-close to z and z0 = Hr(z) with δ = O (1/ log n) and r = O (log2 n), then z0 and z has the same support.

Proof Since z0 is δ-close to z , then z0 z δ and |zi z i | δ for every i. For i supp(z ), |zi| |z i | |zi z i | τ δ

and for i / supp(z ), |zi| δ. Since τ > O(1/ r) δ, then the r-largest entries of z are in the support z , and hence z0 and z has the same support.

Nguyen, Wong, and Hegde

Algorithm 3 Pairwise Reweighting with Hard-Thresholding

Initialize L = Randomly divide p samples into two disjoint sets P1 and P2 of sizes p1 and p2 respectively While |L| < m. Pick u and v from P1 at random

Construct the re-weighted covariance matrix c Mu,v:

i=1 y(i), u y(i), v y(i)(y(i))T

Compute the top singular values δ1, δ2 and top singular vector z of c Mu,v If δ1 Ω(k/m) and δ2 < O (k/m log n)

z = Hr(z), where Hr keeps r largest entries of z If z is not within distance 1/ log n of any vector in L even with sign ﬂip

L = L {z} Return A0 = (L1, . . . , Lm)

Theorem 8 Suppose that Assumptions B1-B4 hold and Assumptions A1-A3 satify with µ = O n k log3 n and r = O (log2 n). When p1 = eΩ(m) and p2 = eΩ(mk), then with high

probability Algorithm 3 returns an initial estimate A0 whose columns share the same support as A and with (δ, 2)-nearness to A with δ = O (1/ log n).

This algorithm requires r = O (log2 n), which is somewhat better than ours. However, the sample complexity and running time is inferior as compared with our novel algorithm.

Appendix G. Neural Implementation of Our Approach

We now brieﬂy describe why our algorithm is neurally plausible . Basically, similar to the argument in Arora et al. (2015), we describe at a very high level how our algorithm can be implemented via a neural network architecture. One should note that although both our initialization and descent stages are non-trivial modiﬁcations of those in Arora et al. (2015), both still inherit the nice neural plausiblity property.

G.1. Neural implementation of Stage 1: Initialization

Recall that the initialization stage includes two main steps: (i) estimate the support of each column of the synthesis matrix, and (ii) compute the top principal component(s) of a certain truncated weighted covariance matrix. Both steps involve simple vector and matrix-vector manipulations that can be implemented plausibly using basic neuronal manipulations. For the support estimation step, we compute the product y, u y, u y y, followed by a thresholding. The inner products, y, u and y, v can be computed using neurons via an online manner where the samples arrive in sequence; the thresholding can be implemented via a Re LU-type non-linearity. For the second step, it is well known that the top principal components of a matrix can be computed in a neural (Hebbian) fashion using Oja s Rule Oja (1992).

Provably Accurate Double-Sparse Coding

r y x = threshold(AT y)

Hebbian rule Aij = ηrixj

Figure 4: Neural network implementation of Algorithm 2. The network takes the image y as input and produces the sparse representation x as output. The hidden layer represents the residual between the image and its reconstruction Ax. The weights Aij s are stored on synapses, but most of them are zero and shown by the dotted lines.

G.2. Neural implementation of Stage 2: Descent

Our neural implementation of the descent stage (Algorithm 2), shown in Figure 4, mimics the architecture of Arora et al. (2015), which describes a simple two-layer network architecture for computing a single gradient update of A. The only diﬀerence in our case is that most of the value in A are set to zero, or in other words, our network is sparse. The network takes values y from the input layer and produce x as the output; there is an intermediate layer in between connecting the middle layer with the output via synapses. The synaptic weights are stored on A. The weights are updated by Hebbian learning. In our case, since A is sparse (with support given by R, as estimated in the ﬁrst stage), we enforce the condition the corresponding synapses are inactive. In the output layer, as in the initialization stage, the neurons can use a Re LU-type non-linear activation function to enforce the sparsity of x.

Alekh Agarwal, Animashree Anandkumar, Prateek Jain, Praneeth Netrapalli, and Rashish Tandon. Learning sparsely used overcomplete dictionaries. In Conference on Learning Theory, pages 123 137, 2014.

Nguyen, Wong, and Hegde

Michal Aharon, Michael Elad, and Alfred Bruckstein. k-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing, 54(11):4311 4322, 2006.

Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. More algorithms for provable dictionary learning. ar Xiv preprint ar Xiv:1401.0579, 2014a.

Sanjeev Arora, Rong Ge, and Ankur Moitra. New algorithms for learning incoherent and overcomplete dictionaries. In Conference on Learning Theory, pages 779 806, 2014b.

Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, eﬃcient, and neural algorithms for sparse coding. In Conference on Learning Theory, pages 113 149, 2015.

Jaros law B lasiok and Jelani Nelson. An improved analysis of the er-spud dictionary learning algorithm. ar Xiv preprint ar Xiv:1602.05719, 2016.

Y-Lan Boureau, Francis Bach, Yann Le Cun, and Jean Ponce. Learning mid-level features for recognition. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2559 2566. IEEE, 2010.

Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE transactions on information theory, 51(12):4203 4215, 2005.

Niladri Chatterji and Peter Bartlett. Alternating minimization for dictionary learning with random initialization. 2017. ar Xiv:1711.03634v1.

Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image processing, 15(12):3736 3745, 2006.

Kjersti Engan, Sven Ole Aase, and J Hakon Husoy. Method of optimal directions for frame design. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 5, pages 2443 2446. IEEE, 1999.

Karol Gregor and Yann Le Cun. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 399 406, 2010.

R emi Gribonval, Rodolphe Jenatton, and Francis Bach. Sparse and spurious: dictionary learning with noise and outliers. IEEE Transactions on Information Theory, 61(11): 6298 6319, 2015a.

R emi Gribonval, Rodolphe Jenatton, Francis Bach, Martin Kleinsteuber, and Matthias Seibert. Sample complexity of dictionary learning and other matrix factorizations. IEEE Transactions on Information Theory, 61(6):3469 3486, 2015b.

Hamid Krim, Dewey Tucker, Stephane Mallat, and David Donoho. On denoising and best signal representation. IEEE Transactions on Information Theory, 45(7):2225 2238, 1999.

Provably Accurate Double-Sparse Coding

Rados law Adamczak. A note on the sample complexity of the er-spud algorithm by spielman, wang and wright for exact recovery of sparsely used dictionaries. Journal of Machine Learning Research, 17:1 18, 2016.

Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for sparse coding. In Proceedings of the 26th International Conference on Machine Learning (ICML), pages 689 696, 2009.

Arya Mazumdar and Ankit Singh Rawat. Associative memory using dictionary learning and expander decoding. In Proc. Conf. American Assoc. Artiﬁcial Intelligence (AAAI), pages 267 273, 2017.

Erkki Oja. Principal components, minor components, and linear neural networks. Neural networks, 5(6):927 935, 1992.

Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311 3325, 1997.

Ron Rubinstein, Alfred M Bruckstein, and Michael Elad. Dictionaries for sparse representation modeling. Proceedings of the IEEE, 98(6):1045 1057, 2010a.

Ron Rubinstein, Michael Zibulevsky, and Michael Elad. Double sparsity: Learning sparse dictionaries for sparse signal approximation. IEEE Transactions on Signal Processing, 58 (3):1553 1564, 2010b.

Daniel A Spielman, Huan Wang, and John Wright. Exact recovery of sparsely-used dictionaries. In Conference on Learning Theory, pages 37 1, 2012.

Jeremias Sulam, Boaz Ophir, Michael Zibulevsky, and Michael Elad. Trainlets: Dictionary learning in high dimensions. IEEE Transactions on Signal Processing, 64(12):3180 3193, 2016.

Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery using nonconvex optimization. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 2351 2360, 2015.

Lingxiao Wang, Xiao Zhang, and Quanquan Gu. A uniﬁed computational and statistical framework for nonconvex low-rank matrix estimation. ar Xiv preprint ar Xiv:1610.05275, 2016.

Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I Jordan. Spectral methods meet em: A provably optimal algorithm for crowdsourcing. Journal of Machine Learning Research, 17(1):3537 3580, 2016.