# testing_sparsity_over_known_and_unknown_bases__c10df3a1.pdf Testing Sparsity over Known and Unknown Bases Siddharth Barman 1 Arnab Bhattacharyya 1 Suprovat Ghoshal 1 Sparsity is a basic property of real vectors that is exploited in a wide variety of machine learning applications. In this work, we describe property testing algorithms for sparsity that observe a lowdimensional projection of the input. We consider two settings. In the first setting, we test sparsity with respect to an unknown basis: given input vectors y1, . . . , yp Rd whose concatenation as columns forms Y Rd p, does Y = AX for matrices A Rd m and X Rm p such that each column of X is k-sparse, or is Y far from having such a decomposition? In the second setting, we test sparsity with respect to a known basis: for a fixed design matrix A Rd m, given input vector y Rd, is y = Ax for some ksparse vector x or is y far from having such a decomposition? We analyze our algorithms using tools from high-dimensional geometry and probability. 1. Introduction Property testing is the study of algorithms that query their input a small number of times and distinguish between whether their input satisfies a given property or is far from satisfying that property. The quest for efficient testing algorithms was initiated by (Blum et al., 1993) and (Babai et al., 1991) and later explicitly formulated by (Rubinfeld & Sudan, 1996) and (Goldreich et al., 1998). Property testing can be viewed as a relaxation of the traditional notion of a decision problem, where the relaxation is quantified in terms of a distance parameter. There has been extensive work in this area over the last couple of decades; see, for instance, the surveys (Ron, 2008) and (Rubinfeld & Shapira, 2006) for some different perspectives. *Equal contribution 1Department of Computer Science and Automation, Indian Institute of Science, Bangalore, India. Correspondence to: Siddharth Barman , Arnab Bhattacharyya , Suprovat Ghoshal . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). As evident from these surveys, research in property testing has largely focused on properties of combinatorial and algebraic structures, such as bipartiteness of graphs, linearity of Boolean functions on the hypercube, membership in errorcorrecting codes or representability of functions as concise Boolean formulae. In this work, we study the question of testing properties of continuous structures, specifically properties of vectors and matrices over the reals. Our computational model is a natural extension of the standard property testing framework by allowing queries to be linear measurements of the input. Let P Rd be a property of real vectors. Let dist : Rd R 0 be a distance function such that dist(x) = 0 for all x P. We say that an algorithm A is a tester for P with respect to dist and with parameters ε, δ > 0 if for any input y Rn, the algorithm A observes My where M Rq d is a randomized matrix and has the following guarantee: (i) If y P, Pr M[A(My) accepts] 1 δ. (ii) If dist(y) > ε, Pr M[A(My) accepts] δ. We call each inner product between the rows of M and y a (linear) query, and the number of rows q = q(ε, δ) is the query complexity of the tester. The running time of the tester A is its running time on the outcome of its queries. As typical in property testing, we do not count the time needed to evaluate the queries. If P Rd p is a property of real matrices with an associated distance function dist : Rd p R 0, testing is defined similarly: given an input matrix Y Rd p, the algorithm observes MY for a random matrix M Rq d with analogous completeness and soundness properties. A linear projection of an input vector or matrix to a low-dimensional space is also called a linear sketch or a linear measurement. The technique of obtaining small linear sketches of high-dimensional vectors has been used to great effect in algorithms for streaming (e.g., (Alon et al., 1996; Mc Gregor, 2014)) and numerical linear algebra (see (Woodruff, 2014) for an excellent survey). Because GPUs are specially designed to optimize matrix-vector computation, many modern optimization and learning algorithms work with linear sketches of their input. We focus on testing whether a vector is sparse with respect Testing Sparsity over Known and Unknown Bases to some basis.1 A vector x is said to be k-sparse if it has at most k nonzero coordinates. Sparsity is a structural characteristic of signals of interest in a diverse range of applications. It is a pervasive concept throughout modern statistics and machine learning, and algorithms to solve inverse problems under sparsity constraints are among the most successful stories of the optimization community (see the book (Hastie et al., 2015)). The natural property testing question we consider is whether there exists a solution to a linear inverse problem under a sparsity constraint. There are two settings in which we investigate the sparsity testing problem. (a) In the first setting, the basis is not known in advance. For input vectors y1, y2, . . . , yp Rd, the property to test is whether there exists a matrix A Rd m and k-sparse unit vectors x1, x2, . . . xp Rm such that yi = Axi for all i [p]. Note that m is specified as a parameter and could be much larger than d (the overcomplete case). In this setting, we restrict the unknown A to be a (ε, k)-RIP matrix which means that (1 ε) x Ax (1 + ε) x for any ksparse x. This is a standard assumption made in many related works (see Section 1.2 for details). In this setting, we design an efficient tester for this property that projects the inputs to O(ε 2 log p) dimensions and, informally speaking, rejects if for all (ε, k)-RIP matrices A, there is some yi such that yi Axi has large norm for all approximately sparse xi. (b) In the second setting, a design matrix A Rd m is known explicitly, and the property to test is whether a given input vector y Rd equals Ax for a k-sparse vector x Rm. For instance, A can be the Fourier basis or an overcomplete dictionary in an image processing application. We approach this problem in full generality, without putting any restriction on the structure of A. Informally, our main result in this setting is that for any design matrix A, there exists a tester projecting the input y to O(k log m) dimensions that rejects if y Ax has large norm for any O(k)-sparse x. The running time of the tester is polynomial in m. As we describe in Section 1.2, previous work in numerical linear algebra yields a tester with the same query complexity and with qualitatively similar soundness guarantees but which requires running time exponential in m or assumptions about the matrix A. Remark 1.1 (Problem Formulation). Note that the settings considered in the known and unknown design matrix settings 1With slight abuse of notation, we use the term basis to denote the set of columns of a design matrix. The columns might not be linearly independent. are quite different from each other. In particular, for the known design setting, the input is a single vector. However, given a single input vector y Rd, the analogous unknown design testing question would be moot, since one can always consider the vector y to be the design matrix A, in which it trivially admits a 1-sparse representation. For the same reason, unknown design testing is interesting only when the number of vectors p exceeds m. In both of the above tests, the measurement matrix is a random matrix with iid gaussian entries, chosen so as to preserve norms and certain other geometric properties upon dimensionality reduction.2 In particular, our testers are oblivious to the input. It is a very interesting open question as to whether non-oblivious testers can strengthen the above results. 1.1. Our Results We now present our results more formally. For integer m > 0, let Sm 1 = {x Rm : x = 1}, and let Spm k = {x Sm 1 : x 0 k}.3 Theorem 1.2 (Unknown Design Matrix). Fix ε, δ (0, 1) and positive integers d, k, m and p, such that (k/m)1/8 < ε < 1 100 and k 10 log 1 ε. There exists a tester with query complexity O(ε 2 log (p/δ)) which, given as input vectors y1, y2, . . . , yp Rd, has the following behavior (where Y is the matrix having y1, y2, . . . , yp as columns): Completeness: If Y admits a decomposition Y = AX, where A Rd m satisfies (ε, k)-RIP and X Rm p with each column of X in Spm k , then the tester accepts with probability 1 δ. Soundness: Suppose Y does not admit a decomposition Y = A(X + Z) + W with 1. The design matrix A Rd m being (ε, k)-RIP, with ai = 1 for every i [m]. 2. The coefficient matrix X Rm p being column wise ℓ-sparse, where ℓ= O(k/ε4). 3. The error matrices Z Rm p and W Rd p zi ε2, wi 2 O(ε1/4) for all i [p]. Then the tester rejects with probability 1 δ. 2If evaluating the queries efficiently was an objective, one could also use sparse dimension reduction matrices (Dasgupta et al., 2010; Kane & Nelson, 2014; Bourgain et al., 2015), but we do not pursue this direction here. 3Here, x 0 denotes the the sparsity of the vector, x 0 := |{i [m] | xi = 0}|. Without any subscript, denotes the ℓ2-norm: x := p P Testing Sparsity over Known and Unknown Bases The contrapositive of the soundness guarantee from the above theorem states that if the tester accepts, then matrix Y admits a factorization of the form Y = A(X+Z)+W, with error matrices Z and W having ℓ and ℓ2 error bounds. The matrix X+Z is a sparse matrix with ℓ -based thresholding, and W is an additive ℓ2-error term.4 Theorem 1.3 (Known Design Matrix). Fix ε, δ (0, 1) and positive integers d, k, m and a matrix A Rd m such that ai = 1 for every i [m]. There exists a tester with query complexity O(kε 2 log(m/δ)) that behaves as follows for an input vector y Rd: Completeness: If y = Ax for some x Spm k , then the tester accepts with probability 1. Soundness: If Ax y 2 > ε for every x : x 0 K, then the tester rejects with probability 1 δ. Here, K = O(k/ε2). The running time of the tester is poly(m, k, 1/ε). A different way of stating the result is that the tester, using O(kε 2 log(m/δ)) linear queries, accepts with probability 1 if y = Ax for a k-sparse x Rm and rejects with probability 1 δ if Ax y > ε x for every O(k/ε2)- sparse x. To complement this result, we show that a better tradeoff between the sparsity and reconstruction error is likely to be impossible. Theorem 1.4 (Hardness). Assume SAT does not have n O(log log n)-time algorithms, and let η be any constant less than 1. Then, there does not exist a polynomial time algorithm that, given input A Rd m (where ai = 1 for every i [m]), y Rd and ε > 0, distinguishes with constant probability between the following two cases: (i) y = Ax for a k-sparse x, and (ii) y Ax > ε x η for every (k/ε2)-sparse x. Note that the above hardness applies to any polynomial time algorithm, not just sketching algorithms. We also give tolerant variants of these testers (Theorems H.1 and H.2) which can handle bounded noise for the completeness case. Moreover, the tester for the known design case can be converted into a new sketching algorithm for sparse recovery (Theorem D.1). Finally, we also give an algorithm for testing dimensionality, which is based on similar techniques. Theorem 1.5 (Testing Dimensionality). Fix ε, δ (0, 1), positive integers d, k and p, where k 10ε2 log d. There exists a tester with query complexity O(p log δ 1), which 4Theorem 1.2 can be restated in terms of incoherent (instead of RIP) design matrices as well. This follows from the fact that the incoherence and RIP constants of a matrix are order-wise equivalent. This observation is formalized in Appendix F. gives as input vectors y1, . . . , yp Sd 1, has the following behavior: Completeness: If rank(Y ) k, then the tester accepts with probability 1 δ. Soundness: If rankε(Y ) k , then the tester rejects with probability 1 δ. Here, k = 20k/ε2 The soundness criteria in the above Theorem is stated in terms of the ε-approximate rank of a matrix (see Definition E.1). This is a well-studied relaxation of the standard definition of rank, and has applications in approximation algorithms, communication complexity and learning theory (see (Alon et al., 2013) and references therein). 1.2. Related Work Although, to the best of our knowledge, the testing problems we consider have not been explicitly investigated before, there are several related areas of study that frame our results in their proper context. Unknown Design setting. In the setting of the unknown design matrix, the question of recovering the design matrix and the sparse representation (as opposed to our problem of testing their existence) is called the dictionary learning or sparse coding problem. The first work to give a dictionary learning algorithm with provable guarantees was (Spielman et al., 2012) where the dictionary was restricted to be square. For the more common overcomplete setting, (Arora et al., 2014) and (Agarwal et al., 2014) independently gave algorithms with provable guarantees for dictionaries satisfying incoherence and RIP respectively. All of these (as well as other more recent) works assume distributions from which the input samples are generated in an i.i.d fashion. In contrast, our work is in the agnostic setting and hence, is incomparable with these results. It is known that the dictionary learning problem is NP-hard, even for square dictionaries (Razaviyayn et al., 2014; Tillmann, 2015). In fact, (Tillmann, 2015) shows that unless SAT has a quasi-polynomial time algorithm, it is impossible, given Y Rd p, to approximate in polynomial time the minimum k upto a factor 2log1 ε d (for any ε > 0) such that Y = AX where each column of X Rd p is k-sparse. This motivates our bicriteria relaxation of both the sparsity as well as the additive error in Theorem 1.2. Known Design setting. Some results about testing sparsity in the known design setting are implicit in recent work on streaming algorithms and oblivious subspace embeddings. Of particular interest are the following results: Theorem 1.6 (Implicit in (Kane et al., 2010)). Fix ε (0, 1), positive integers m, k and an invertible matrix A Testing Sparsity over Known and Unknown Bases Rm m. Then, there is a tester with query complexity O(ε 2 log(m)) that, for an input y Rm, accepts with probability at least 2/3 if y = Ax for some k-sparse x Zm, and rejects with probability 2/3 if y = Ax for all (1 + ε)k-sparse x Zm. The running time of the algorithm is poly(m, 1/ε). Theorem 1.7 (Implicit in prior work, see (Woodruff, 2014)). Fix ε, δ (0, 1) and positive integers d, k, m and a matrix A Rd m. Then, there is a tester with query complexity O(kε 2 log(m/δ)) that, for an input vector y Rd, accepts with probability 1 if y = Ax for some k-sparse x and rejects with probability at least 1 δ if y Ax > ε for all k-sparse x. The running time of the tester is the time required to solve the following optimization problem: bx = arg min x K SAx Sy = arg min x K S(Ax y) (1) where S Rq d is a random sketch matrix (where q d) and K = {x : x 0 k} Detailed descriptions of the algorithms and proof sketches for the above Theorems are given in Section B.4. The algorithms from the above theorems come with significant limitations. In particular, the guarantees for Theorem 1.6 hold only when the design matrix is invertible. On the other hand, the running time for the algorithm in Theorem 1.7 is the cost of solving the optimization problem in Equation (1), which is known to be NP-hard for general matrices. The problem of testing sparsity has also been studied in non-sketching settings as well, where the algorithm is allowed access to the entire input. In particular, (Natarajan, 1995) gave a bicriteria-approximation algorithm, where the blowup in the sparsity is proportional to A 2 2 (which can be large if A is ill conditioned). Testing Dimensionality. In (Czumaj et al., 2000), some problems in computational geometry were studied from the property testing perspective, but the problems involved only discrete structures. (Krauthgamer & Sasson, 2003) studied the problem of testing dimensionality, but their notion of farness from being low-dimensional is different from ours5. (Chierichetti et al., 2017) gave approximation algorithms for computing approximate rank of the matrix, in the setting where the algorithms have full access to the input. 1.3. Discussion A standard approach to designing a testing algorithm for a property P is the following: we identify an alternative property P which can be tested efficiently and exactly, while satisfying the following: 5In their setup, a sequence of vectors y1, . . . , yp is ε-far from being d-dimensional if at least εp vectors need to be removed to make it be of dimension d (i) Completeness: If an instance satisfies P, then it satisfies P . (ii) Soundness: If an instance satisfies P , the it is close to satisfying P. In other words, we reduce the property testing problem to that of finding a efficiently testable property P , which can be interpreted as a surrogate for property P. The inherent geometric nature of the problems looked at in this paper motivate us to look for P s which are based around convex geometry and high dimensional probability. For the unknown design setting, we are intuitively looking for a P based on a quantity ω that robustly captures sparsity and is easily computable using linear queries, in the sense that ω is small when the input vectors have a sparse coding and large when they are far from any sparse coding. Moreover, ω needs to be invariant with respect to isometries and nearly invariant with respect to near-isometries. A natural and widely-used measure of structure that satisfies the above mentioned properties is the gaussian width. Definition 1.8. The gaussian width of a set S Rd is: ω(S) = Eg[supv S g, v ] where g Rd is a random vector drawn from N(0, 1)d, i.e., a vector of independent standard normal variables. The gaussian width of S measures how well on average the vectors in S correlate with a randomly chosen direction. It is invariant under orthogonal transformations of S as the distribution of g is spherically symmetric. It is a well-studied quantity in high-dimensional geometry ((Vershynin, 2015; Mendelson & Vershynin, 2002)), optimization ((Chandrasekaran et al., 2012; Amelunxen et al., 2013)) and statistical learning theory ((Bartlett & Mendelson, 2002)). The following bounds are well-known. Lemma 1.9 (See, for example, (Rudelson & Vershynin, 2008; Vershynin, 2015)). (i) If S is a finite subset of Sd 1, then ω(S) p (ii) ω(Sd 1) (iii) If S Sd 1 is of dimension k, then ω(S) (iv) ω(Spd k) 2 p 3k log(d/k) when d/k > 2 and k 4. In the context of Theorems 1.2 and 1.5, one can observe that whenever a given set satisfies sparsity or dimensionality constraints, the gaussian width of such sets are small (points (iii) and (iv) from the above Lemma). Therefore, one can hope to test dimensionality or sparsity by computing an empirical estimate of the gaussian width and comparing the estimate to the results in Lemma 1.9. While completeness of such testers would follow directly from concentration of measure, establishing soundness would require us to show Testing Sparsity over Known and Unknown Bases that approximate converses of points (iii) and (iv) hold as well i.e., whenever the gaussian width of the set S is small, it can be approximated by sets which are approximately sparse in some design matrix (or have low rank). For the soundness direction of Theorem 1.2, the above arguments are made precise using Lemma 3.3 and Theorem 3.2, which show that small gaussian width sets can be approximated by random projections of sparse vectors and vectors with small ℓ -norm. For Theorem 1.5, we use lemma E.2 which shows that sets with small gaussian width have small approximate rank. For the known design setting, we are looking for a P , which would ensure that if a given point y Rd satisfies P , then it is close to having a sparse representation in the matrix A. Towards this end, the approximate Carath eodory s theorem states that if a point y Rd belonging to the convex-hull of A, then it is close to another point which admits a sparse representation. On the other hand, if a unit vector x Sd 1 Rd + were k-sparse to begin with , then it can be seen that the corresponding y = Ax would belong to the convex hull of k A. These observations taken together, seem to suggest that one can take P to be membership in the convex-hull of k A. This intuition is made precise in the analysis of the tester in Section 4. 1.4. Organization Section 2 introduces notations and preliminaries used in the rest of the paper. In Sections 3 and 4, we design and analyze the testers for the unknown and known basis setting respectively. Section 5 contains empirical results which supplement Section 3. In Section B we prove additional lemmas used in the proof of Theorem 3.2, and in Section A we prove Theorem 3.2. In Section C, we prove Theorem C.1, a stronger version of Theorem 1.4. In Section D, we show that Theorem 1.3 yields a sketching algorithm for sparse recovery. In Section E, we design and analyze the dimensionality tester. In Section G, we describe the results for testing sparsity in the known case implicit in previous work. Finally, in Section H, we give noise tolerant testers for the known and unknown basis settings. 2. Preliminaries Given S Rd, we shall use conv(S) to denote the convex hull of S. For a vector x Rd, we use p to denote its ℓp-norm, and we will drop the indexing when p = 2. We denote the ℓ2-distance of the point x to the set S by dist(x, S). We recall the definition of ε-isometry: Definition 2.1. Given sets S Rm and S Rn (for some m, n N), we say that S is an ε-isometry of S, if there exists a mapping ψ : S 7 S which satisfies the following x, y S : (1 ε) x y ψ(x) ψ(y) (1+ε) x y For the unknown design setting, we shall require the notion of Restricted Isometry Property, which is defined as follows: Definition 2.2 ((ε, k)-RIP). A matrix A Rd m satisfies (ε, k)-RIP, if for every x Spm k the following holds: (1 ε) x Ax (1 + ε) x (2) We use the following version of Gordon s Theorem repeatedly in this work. Theorem 2.3 (Gordon s Theorem (Gordon, 1985)). Given S SD 1 and a random gaussian matrix G 1 d N(0, 1)d D, we have h max x S Gx 2 i 1 + ω(S) It directly implies the following generalization of the Johnson-Lindenstrauss lemma. Theorem 2.4 (Generalized Johnson-Lindenstrauss lemma). Let S Sn 1. Then there exists linear transformation Φ : Rn 7 Rd , for d = O ω(S)2 ε2 , such that Φ is an ε-isometry on S. Moreover, Φ 1 d N(0, 1)d n is an ε-isometry on S with high probability. It can be easily verified that the quantity maxx S Gx 2 is 1-Lipschitz with respect to G. Therefore, using Gaussian concentration for Lipschitz functions, we get the following corollary : Corollary 2.5. Let S and G be as in Theorem 2.3. Then for all ε > 0, we have max x S Gx 2 1 + 1 + ε ω(S) exp O(εω(S))2 The following lemma gives concentration for the gaussian width: Lemma 2.6 (Concentration on the gaussian width (Boucheron et al., 2013)). Let S Rd. Let W = supv S g, v where g is drawn from N(0, 1)d. Then: Pr[|W E W| > u] < 2e u2 where σ2 = supv S v 2 2 . Notice that the bound is dimension independent. Lastly, we shall use the ℓ2-variant of the approximate Carath eodory s Theorem: Testing Sparsity over Known and Unknown Bases Theorem 2.7. (Theorem 0.1.2 (Vershynin, 2016) ) Given X = {w1, . . . , wp} where wi 1 for every i [p]. Then for every choice z conv X and k N, there exists wi1, wi2, . . . , wik such that 1 k j [k] wij z 2 2.1. Algorithmic Estimation of Gaussian Width and Norm of a vector We record here simple lemmas bounding the number of linear queries needed to estimate the gaussian width of a set and the length of a vector. Lemma 2.8 (Estimating Gaussian Width using linear queries). For any u > 4, ε (0, 1/2) and δ > 0, there is a randomized algorithm that given a set S Rd and v [1 ε] for all v S, computes ˆω such that ω(S) u ˆω ω(S) + u with probability at least 1 δ. The algorithm makes O(log(1/δ) |S|) linear queries to S. Proof. By Lemma 2.6, for a random g N(0, 1)d, supv S g, v is away from ω(S) by u with probability at most 2e 16/4.5 < 0.1. By the Chernoff bound, the median of O(log δ 1) trials will satisfy the conditions required of ˆω with probability at least 1 δ. Lemma 2.9 (Estimating norm using linear queries). Given ε (0, 1/2) and δ > 0, for any vector x Rd , only O(ε 2 log δ 1) linear queries to x suffice to decide whether x [1 ε, 1 + ε] with success probability 1 δ. Proof. It is easy to verify that Eg N(0,1)d[ g, x 2] = x 2. Therefore, it can be estimated to a multiplicative error of (1 ε/2) by taking the average of the squares of linear measurements using O 1 ε2 log 1 δ -queries. For the case x 2 2, a multiplicative error (1 ε/2) implies an additive error of ε. Furthermore, when x 2 2, a multiplicative error of (1 ε/2) implies that L 2(1 ε/2) > 1 + ε for ε < 1/2. 3. Analysis for Unknown Design setting In this section, we prove Theorem 1.2. Let S denote the set {y1, . . . , yp}. Our testing algorithm is shown in Algorithm 1. The number of linear queries made by the tester is O(pε 2 log(p/δ)) in Line 1 and O(p log δ 1) in Line 2. 3.1. Completeness Assume that for each i [p], yi = Axi for a matrix A Rd m satisfying (ε, k)-RIP and xi Spm k . By definition Algorithm 1 Sparse Test Unknown 1: Use Lemma 2.9 to decide with probability at least 1 δ/2 if there exists yi such that yi [1 2ε, 1 + 2ε]. Reject if so. 2: Use Lemma 2.8 to obtain ˆω, an estimate of ω(S) within additive error p 3k log(m/k) with probability at least 1 δ/2. 3: Accept if ˆω 4 p 3k log(m/k), else reject. of RIP, we know that 1 ε yi 1 + ε, so that Line 1 of the algorithm will pass with probability at least 1 δ/2. From Lemma 1.9, we know that ω({x1, . . . xp}) 2 p 3k log(m/k). Lemma 3.1 shows that the gaussian width of S is approximately the same; its proof, deferred to the appendix (Section B.4), uses Slepian s Lemma (Lemma B.3). Lemma 3.1. Let X Sm 1 be a finite set, and let S Rd be an ε-isometric embedding of X. Then (1 ε)ω(X) ω(S) (1 + ε)ω(X) (4) Hence, the gaussian width of y1, . . . , yp is at most 2(1 + ε) p 3k log(m/k). Taking into account the additive error in Line 2, we see that with probability at least 1 δ/2, ˆω (3 + 2ε) p 3k log(m/k) 4 p 3k log(m/k). Hence, the tester accepts with probability at least 1 δ. 3.2. Soundness As mentioned before, in order to prove soundness we need to show that whenever the gaussian width of the set S is small, it is close to some sparse point-set. Let ω = 4p3k log m k . We shall break the analysis into two cases: Case (i) ω (ε/C)2 d : For this case, we use the fact random projection of discretized sparse point-sets (Definition A.1) form an appropriated cover of S. This is formalized in the following theorem, which in a sense shows an approximate inverse of Gordon s Theorem for sparse vectors: Theorem 3.2. Given ε > 0 and integers C, d, k and m, let n = O k ε2 log(m/k) . Suppose m k/ε8. Let Φ : Rm 7 Rn be drawn from 1 n N(0, 1)n m. Then, for ℓ = O(kε 4), with high probability, the set Φnorm(c Sp m ℓ) is an O(ε1/4)-cover of Sn 1, where Φnorm(x) = Φ(x)/ Φ(x) 2. The proof of the above Theorem is deferred to Section A. From the choice of parameters we have d C k k Therefore, using the above Theorem we know that there Testing Sparsity over Known and Unknown Bases exists (ε, k)-RIP matrix Φ Rd m such that Φnorm Spm ℓ is an O(ε1/4)-cover of Sd 1 (and therefore it is a ε1/4cover of S). Therefore, there exists X Rm p such that Y = Φ(X) + E where the columns of X and E satisfy the respective 0 and 2-upper bounds respectively. Case (ii) ω (ε/C)2 d : For this case, we use the following result on the concentration of ℓ -norm: Lemma 3.3. Given S Sd 1, we have max y R(S) y C ω(S) where Od is the orthogonal group in Rd i.e., R is a uniform random rotation. Although this concentration bound is known, for completeness we give a proof in the appendix (Section B.7). From the above lemma, it follows that there exists R Od such that for any z Z := R(S) we have z ε2 and therefore Y = R 1Z. Furthermore, since R is orthogonal, therefore the matrix R 1 is also orthogonal, and therefore it satisfies (ε, k)-RIP. To complete the proof, we observe that even though the given factorization has inner dimension d, we can trivially extend it to one with inner dimension m. This can be done by constructing Φ = R 1 G with G 1 d N(0, 1)d m d. Since ω d, from Theorem 2.4 it follows that with high probability G (and consequently Φ) will satisfy (ε, k)-RIP. Finally, we construct ˆZ Rm n by padding Z with m d rows of zeros. Therefore, by construction Y = Φ ˆZ, where for every i [p] we have zi ε2. Hence the claim follows. 4. Analysis for the Known Design setting In this section, we describe and analyze the tester for the known design matrix case. The algorithm itself is a simple convex-hull membership test, which can be solved using a linear program. Algorithm 2 Sparse Test-Known Design 1: Set n = 100klog m δ , sample projection matrix Φ 1 n N(0, 1)n d 2: Observe linear sketch y = Φ(y) 3: Let A = A A 4: Accept iff y k conv Φ(A ) We shall now prove the completeness and soundness guarantees of the above tester. The running time bound follows because convex hull membership reduces to linear programming. Completeness Let y = Ax where A Rd m is an arbitrary matrix with ai = 1 for every i [m]. Furthermore x 2 = 1 and x 0 k. Therefore, by Cauchy-Schwartz we have x 1 k. Hence, it follows that y k conv(A ). Since Φ : Rm 7 Rd is a linear transformation, we have Φ(y) k conv(Φ(A )). Therefore, the tester accepts with probability 1. Soundness Consider the set Aε/ k which is the set of all (2k/ε2)-uniform convex combinations of k(A ) i.e., 2k vi : multiset Ω Then, from the approximate Carath eodory theorem, it follows that Aε/ k is an ε-cover of k conv A . Further- k| (2m)2k/ε2. By our choice of n, with probability at least 1 δ/2, the set Φ {y} Aε/ ε-isometric to {y} Aε/ k . Again, by the approximate Carath eodory s theorem, the set Aε/ k is an ε-cover of k conv(A ) . Now suppose the test accepts y with probability at least δ. Then, with probability at least δ/2, the test accepts and the above ε-isometry conditions hold simultaneously. Then, k conv Φ(A ) 1 dist y, Aε/ 2 dist y, Aε/ k ε(1 ε) 1 2ε k conv(A ) 2ε where step 1 follows from the ε-cover guarantee of Aε/ k, step 2 follows from the ε-isometry guarantee. Invoking the approximate Carath eodory theorem, we get that there exists ˆy = Aˆx k conv( A) such that ˆx 0 O(k/ε2) and ˆy y O(ε). This completes the soundness direction. 5. Experimental Results Our algorithm for the unknown design setting is based on the principle that the property of sparse representability in some basis admits an approximate characterization in terms of gaussian width. This section provides experimental evidence which supplements our theoretical results. For the empirical study, we use the classic Barbara image (which is of size 512 512 pixels). Specifically, we consider 9 subimages of size 100 100 pixels each (see Figure 1). For each such sub-image, we compute a matrix representation (by the standard technique of subdividing the images into patches, see, e.g., (Elad & Aharon, 2006)). In particular, Testing Sparsity over Known and Unknown Bases Figure 1. Sub-images used as data points. Figure 2. Correlation of gaussian width and reconstruction error. each sub-image is represented as a matrix Y of dimension 64 8649. Then, for each matrix Y corresponding to a sub-image, we estimate the gaussian width of the ℓ2-column normalized matrix. In addition, setting the number of atoms m = 100 and sparsity k = 10, we run the k-SVD algorithm for 50 iterations and record the reconstruction error.6 Figure 2 shows the comparison between gaussian width and reconstruction error, in which we observe that there is an approximate correlation between the two quantities. In particular, for sub-images 2,7 and 8 which mostly consist of background both the gaussian width and the reconstruction error is small. On the other hand, images 3, 6 and 9, which consist of intricate patterns and objects, have large gaussian width as well as large reconstruction error. Consequently, we can deduce that for sub-images with large gaussian width, in order to achieve low reconstruction error, one would have consider a larger number of atoms m or larger sparsity k. 6. Conclusion and Open Questions In this paper, we studied the problem of testing sparsity with respect to unknown and known bases. While the optimization variants of these problems (namely Dictionary Learning and Sparse Recovery) are known to be NP-hard in the worst case, our results show that under appropriate relaxations, these problems admit efficient property testing algorithms. Future work include designing testing algorithms for sparsity over an unknown basis with stronger 6For a matrix Y Rd n approximated by overcomplete basis A and coefficient matrix X, the reconstruction error is equal to Y AX 2 F /(n.d). guarantees or developing impossibility results. We also hope that this paper leads to study of property testing of other widely studied hypotheses in machine learning such as nonnegative rank and VC-dimension. Acknowledgements We would like to thank David Woodruff for showing us the sketching-based tester described in Section 1.2. Agarwal, A., Anandkumar, A., Jain, P., Netrapalli, P., and Tandon, R. Learning sparsely used overcomplete dictionaries. In Proc. 27th Annual ACM Workshop on Computational Learning Theory, pp. 123 137, 2014. Alon, N., Matias, Y., and Szegedy, M. The space complexity of approximating the frequency moments. In Proc. 28th Annual ACM Symposium on the Theory of Computing, pp. 20 29. ACM, 1996. Alon, N., Lee, T., Shraibman, A., and Vempala, S. The approximate rank of a matrix and its algorithmic applications: approximate rank. In Proc. 45th Annual ACM Symposium on the Theory of Computing, pp. 675 684. ACM, 2013. Amelunxen, D., Lotz, M., Mc Coy, M. B., and Tropp, J. A. Living on the edge: A geometric theory of phase transitions in convex optimization. Co RR, abs/1303.6672, 2013. Arora, S., Ge, R., and Moitra, A. New algorithms for learning incoherent and overcomplete dictionaries. In Testing Sparsity over Known and Unknown Bases Proc. 27th Annual ACM Workshop on Computational Learning Theory, pp. 779 806, 2014. Babai, L., Fortnow, L., and Lund, C. Non-deterministic exponential time has two-prover interactive protocols. Computational Complexity, 1(1):3 40, 1991. Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002. Blum, A., Har-Peled, S., and Raichel, B. Sparse approximation via generating point sets. In Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms, pp. 548 557. SIAM, 2016. Blum, M., Luby, M., and Rubinfeld, R. Selftesting/correcting with applications to numerical problems. J. Comp. Sys. Sci., 47:549 595, 1993. Earlier version in STOC 90. Boucheron, S., Lugosi, G., and Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence. OUP Oxford, 2013. ISBN 9780199535255. Bourgain, J., Dirksen, S., and Nelson, J. Toward a unified theory of sparse dimensionality reduction in euclidean space. Geometric and Functional Analysis, 25(4):1009 1088, 2015. Chandrasekaran, V., Recht, B., Parrilo, P. A., and Willsky, A. S. The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805 849, 2012. Chierichetti, F., Gollapudi, S., Kumar, R., Lattanzi, S., Panigrahy, R., and Woodruff, D. P. Algorithms for $\ell p$ low-rank approximation. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 806 814, 2017. Czumaj, A., Sohler, C., and Ziegler, M. Property testing in computational geometry. In Proc. 8th European Symposium on Algorithms, pp. 155 166. Springer, 2000. Dasgupta, A., Kumar, R., and Sarl os, T. A sparse Johnson Lindenstrauss transform. In Proc. 42nd Annual ACM Symposium on the Theory of Computing, pp. 341 350. ACM, 2010. Elad, M. and Aharon, M. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image processing, 15(12):3736 3745, 2006. Foster, D., Karloff, H., and Thaler, J. Variable selection is hard. In Conference on Learning Theory, pp. 696 709, 2015. Goldreich, O., Goldwasser, S., and Ron, D. Property testing and its connection to learning and approximation. J. ACM, 45:653 750, 1998. Gordon, Y. Some inequalities for gaussian processes and applications. Israel Journal of Mathematics, 50(4):265 289, 1985. Hastie, T., Tibshirani, R., and Wainwright, M. Statistical learning with sparsity: the lasso and generalizations. CRC press, 2015. Kane, D. M. and Nelson, J. Sparser johnson-lindenstrauss transforms. Journal of the ACM (JACM), 61(1):4, 2014. Kane, D. M., Nelson, J., and Woodruff, D. P. An optimal algorithm for the distinct elements problem. In Proc. 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of database systems, pp. 41 52. ACM, 2010. Krauthgamer, R. and Sasson, O. Property testing of data dimensionality. In Proc. 14th ACM-SIAM Symposium on Discrete Algorithms, pp. 18 27. Society for Industrial and Applied Mathematics, 2003. Mc Gregor, A. Graph stream algorithms: a survey. ACM SIGMOD Record, 43(1):9 20, 2014. Mendelson, S. and Vershynin, R. Entropy, combinatorial dimensions and random averages. In Computational Learning Theory, 15th Annual Conference on Computational Learning Theory, COLT 2002, Sydney, Australia, July 8-10, 2002, Proceedings, pp. 14 28, 2002. Natarajan, B. K. Sparse approximate solutions to linear systems. SIAM journal on computing, 24(2):227 234, 1995. Razaviyayn, M., Tseng, H.-W., and Luo, Z.-Q. Dictionary learning for sparse representation: Complexity and algorithms. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5247 5251. IEEE, 2014. Ron, D. Property testing: A learning theory perspective. Foundations and Trends in Machine Learning, 1(3):307 402, 2008. Rubinfeld, R. and Shapira, A. Sublinear time algorithms. In Proc. International Congress of Mathematicians 2006, volume 3, pp. 1095 1110, 2006. Rubinfeld, R. and Sudan, M. Robust characterizations of polynomials with applications to program testing. SIAM J. on Comput., 25:252 271, 1996. Rudelson, M. and Vershynin, R. On sparse reconstruction from fourier and gaussian measurements. Communications on Pure and Applied Mathematics, 61(8):1025 1045, 2008. Testing Sparsity over Known and Unknown Bases Slepian, D. The one-sided barrier problem for gaussian noise. The Bell System Technical Journal, 41(2):463 501, 1962. Spielman, D. A., Wang, H., and Wright, J. Exact recovery of sparsely-used dictionaries. In Proc. 25th Annual ACM Workshop on Computational Learning Theory, pp. 37 1, 2012. Tillmann, A. M. On the computational intractability of exact and approximate dictionary learning. IEEE Signal Processing Letters, 22(1):45 49, 2015. Vershynin, R. Lectures in geometric functional analysis. Preprint, University of Michigan, 2011. Vershynin, R. Estimation in high dimensions: a geometric perspective. In Sampling Theory, a Renaissance, pp. 3 66. Springer, 2015. Vershynin, R. High dimensional probability, 2016. Woodruff, D. P. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1 2):1 157, 2014.