# multiscale_dictionary_learning_nonasymptotic_bounds_and_robustness__5c9cfd29.pdf

Journal of Machine Learning Research 17 (2016) 1-51 Submitted 5/14; Published 2/16

Multiscale Dictionary Learning: Non-Asymptotic Bounds and Robustness

Mauro Maggioni mauro@math.duke.edu Departments of Mathematics, Electrical and Computer Engineering, and Computer Science Duke University Durham, NC 27708, USA

Stanislav Minsker minsker@usc.edu Department of Mathematics University of Southern California Los Angeles, CA 90089, USA

Nate Strawn nate.strawn@georgetown.edu Department of Mathematics and Statistics Georgetown University Washington D.C., 20057, USA

Editor: Benjamin Recht

High-dimensional datasets are well-approximated by low-dimensional structures. Over the past decade, this empirical observation motivated the investigation of detection, measurement, and modeling techniques to exploit these low-dimensional intrinsic structures, yielding numerous implications for high-dimensional statistics, machine learning, and signal processing. Manifold learning (where the low-dimensional structure is a manifold) and dictionary learning (where the low-dimensional structure is the set of sparse linear combinations of vectors from a ﬁnite dictionary) are two prominent theoretical and computational frameworks in this area. Despite their ostensible distinction, the recently-introduced Geometric Multi-Resolution Analysis (GMRA) provides a robust, computationally eﬃcient, multiscale procedure for simultaneously learning manifolds and dictionaries. In this work, we prove non-asymptotic probabilistic bounds on the approximation error of GMRA for a rich class of data-generating statistical models that includes noisy manifolds, thereby establishing the theoretical robustness of the procedure and conﬁrming empirical observations. In particular, if a dataset aggregates near a low-dimensional manifold, our results show that the approximation error of the GMRA is completely independent of the ambient dimension. Our work therefore establishes GMRA as a provably fast algorithm for dictionary learning with approximation and sparsity guarantees. We include several numerical experiments conﬁrming these theoretical results, and our theoretical framework provides new tools for assessing the behavior of manifold learning and dictionary learning procedures on a large class of interesting models.

Keywords: dictionary learning, multi-resolution analysis, manifold learning, robustness, sparsity

c 2016 Mauro Maggioni, Stanislav Minsker, and Nate Strawn.

Maggioni, Minsker, and Strawn

1. Introduction

In many high-dimensional data analysis problems, existence of eﬃcient data representations can dramatically boost the statistical performance and the computational eﬃciency of learning algorithms. Inversely, in the absence of eﬃcient representations, the curse of dimensionality implies that required sample sizes must grow exponentially with the ambient dimension, which ostensibly renders many statistical learning tasks completely untenable. Parametric statistical modeling seeks to resolve this diﬃculty by restricting the family of candidate distributions for the data to a collection of probability measures indexed by a ﬁnite-dimensional parameter. By contrast, nonparametric statistical models are more ﬂexible and oftentimes more precise, but usually require data samples of large sizes unless the data exhibits some simple latent structure (e.g., some form of sparsity). Such structural considerations are essential for establishing convergence rates, and oftentimes these structural considerations are geometric in nature.

One classical geometric assumption asserts that the data, modeled as a set of points in RD, in fact lies on (or perhaps very close to) a single d-dimensional aﬃne subspace V RD

where d D. Tools such as PCA (see Hotelling, 1933, 1936; Pearson, 1901) estimate V in a stable fashion under suitable assumptions. Generalizing this model, one may assert that the data lies on a union of several low-dimensional aﬃne subspaces instead of just one, and in this case the estimation of the multiple aﬃne subspaces from data samples already inspired intensive research due to its subtle complexity (e.g., see Chen and Lerman, 2009; Chen and Maggioni, 2011; Elhamifar and Vidal, 2009; Fischler and Bolles, 1981; Ho et al., 2003; Liu et al., 2010; Ma et al., 2007, 2008; Sugaya and Kanatani, 2004; Tipping and Bishop, 1999; Vidal et al., 2005; Yan and Pollefeys, 2006; Zhang et al., 2010). A widely used form of this model is that of k-sparse data, where there exists a dictionary (i.e., a collection of vectors) Φ = {ϕi}m i=1 RD such that each observed data point x Rd may be expressed as a linear combination of at most k D elements of Φ. These sparse representations oﬀer great convenience and expressivity for signal processing tasks (such as in Peyr e, 2009; Protter and Elad, 2007), compressive sensing, statistical estimation, and learning (e.g., see Aharon et al., 2006; Candes and Tao, 2007; Chen et al., 1998; Donoho, 2006; Kreutz-Delgado et al., 2003; Lewicki et al., 1998; Maurer and Pontil, 2010, among others), and even exhibits connections with representations in the visual cortex (see Olshausen and Field, 1997). In geometric terminology, such sparse representations are generally attainable when the local intrinsic dimension of the observations is small. For these applications, the dictionary is usually assumed to be known a priori, instead of being learned from the data, but it has been recognized in the past decade that data-dependent dictionaries may perform signiﬁcantly better than generic dictionaries even in classical signal processing tasks.

The k-sparse data model motivates a large amount of research in dictionary learning, where Φ is learned from data rather than being ﬁxed in advance: given n samples X1, . . . , Xn from a probability distribution µ in RD representing the training data, an algorithm learns a dictionary bΦ which provides sparse representations for the observations sampled from µ. This problem and its optimal algorithmic solutions are far from being well-understood, at least compared to the understanding that we have for classical dictionaries such as Fourier, wavelets, curvelets, and shearlets. These dictionaries arise in computational harmonic analysis approaches to image processing, and Donoho (1999) (for example) provides rigorous,

Multiscale Dictionary Learning

optimal approximation results for simple classes of images. The work of Gribonval et al. (2013) present general bounds for the complexity of learning the dictionaries (see also Maurer and Pontil, 2010; Vainsencher et al., 2011, and references therein). The algorithms used in dictionary learning are often computationally demanding, and many of them are based on high-dimensional non-convex optimization (Mairal et al., 2010). The emphasis of existing work is often made on the generality of the approach, where minimal assumptions are made on geometry of the distribution from which the sample is generated. These pessimistic techniques incur bounds dependent upon the ambient dimension D in general (even in the standard case of data lying on one hyperplane). A diﬀerent type of geometric assumption on the data gives rise to manifold learning, where the observations aggregate on a suitably regular manifold M of dimension d isometrically embedded in RD (notable works include Belkin and Niyogi, 2003; Coifman et al., 2005a,b; Coifman and Maggioni, 2006; Donoho and Grimes, 2002, 2003; Feﬀerman et al.; Genovese et al., 2012b; Jones et al., 2008, 2010; Little et al., 2012, 2009; Roweis and Saul, 2000; Tenenbaum et al., 2000; Zhang and Zha, 2002, among others). This setting has been recognized as useful in a variety of applications (e.g. Causevic et al., 2006; Coifman et al., 2006; Rahman et al., 2005), inﬂuencing work in the applied mathematics and machine learning communities during the past several years. It has also been recognized that in many cases the data does not naturally aggregate on a smooth manifold (as in Little et al., 2012, 2009; Wakin et al., 2005), with examples arising in imaging that contradict the smoothness conditions. While this phenomenon is not as widely recognized as it probably could be, we believe that it is crucial to develop methods (both for dictionary and manifold learning) that are robust not only to noise, but also to modeling error. Such concerns motivated the work on intrinsic dimension estimation of noisy data sets (see Little et al., 2012, 2009), where smoothness of the underlying distribution of the data is not assumed, but only certain natural conditions (possibly varying with the scale of the data) are imposed. The central idea of the aforementioned works is to perform the multiscale singular value decomposition (SVD) of the data, an approach inspired by the works of David and Semmes (1993) and Jones (1990) in classical geometric measure theory. These techniques were further extended in several directions in the papers by Chen and Maggioni (2011); Chen et al. (2011a,b), while Allard et al. (2012); Chen and Maggioni (2010) built upon this work to construct multiscale dictionaries for the data based on the idea of Geometric Multi-Resolution Analysis (GMRA). Until these recent works introduced the GMRA construction, connections between dictionary learning and manifold learning had not garnered much attention in the literature. These papers showed that, for intrinsically low-dimensional data, one may perform dictionary learning very eﬃciently by exploiting the underlying geometry, thereby illuminating the relationship between manifold learning and dictionary learning. In these papers, it was demonstrated that, in the inﬁnite sample limit and under a manifold model assumption for the distribution of the data (with mild regularity conditions for the manifold), the GMRA algorithm eﬃciently learns a dictionary in which the data admits sparse representations. More interestingly, the examples in that paper show that the GMRA construction succeeds on real-world data sets which do not admit a structure consistent with the smooth manifold modeling assumption, suggesting that the GMRA construction exhibits robustness to modeling error. This desirable behavior follows naturally from design decisions; GMRA

Maggioni, Minsker, and Strawn

combines two elements that add stability: a multiscale decomposition and localized SVD. Similar ideas appeared in work applying dictionary learning to computer vision problems, for example in the paper by Yu et al. (2009), where local linear approximations are used to create dictionaries. These techniques appeared at roughly the same time as GMRA (Chen and Maggioni (2010)), but were not multiscale in nature, and the selection of the local scale is crucial in applications. These techniques also lacked any ﬁnite or inﬁnite sample guarantees, nor considered the eﬀect of noise. They were however successfully applied in computer vision problems, most notably in the Pascal 2007 challenge.

In this paper, we analyze the ﬁnite sample behavior of (a slightly modiﬁed version of) that construction, and prove strong ﬁnite-sample guarantees for its behavior under general conditions on the geometry of a probability distribution generating the data. In particular, we show that these conditions are satisﬁed when the probability distribution is concentrated near a manifold, which robustly accounts for noise and modeling errors. In contrast to the pessimistic bounds mentioned above, the bounds that we prove only depend on the intrinsic dimension of the data. It should be noted that our method of proof produces non-asymptotic bounds, and requires several explicit geometric arguments not previously available in the literature (at least to the best of our knowledge). Some of our geometric bounds could be of independent interest to the manifold learning community.

The GMRA construction is therefore proven to simultaneously learn manifolds (in sense that it outputs a suitably close approximation to points on a manifold) and dictionaries in which data are represented sparsely. Moreover, the construction is guaranteed to be robust with respect to noise and to the perturbations of the manifold model. The GMRA construction is fast, linear in the size of the data matrix, inherently online, does not require nonlinear optimization, and is not iterative. Finally, our results may be combined with recent GMRA compressed sensing techniques and algorithms presented in Iwen and Maggioni (2013), yielding both a method to learn a dictionary in a stable way on a ﬁnite set of training data, and a way of performing compressive sensing and reconstruction (with guarantees) from a small number of (suitable) linear projections (again without the need for expensive convex optimization).

This paper is organized as follows: Section 2 introduces the main deﬁnitions and notation employed throughout the paper. Section 3 explains the main contributions, formally states the results and provides comparison with existing literature. Finally, Sections 4 and 5 are devoted to the proofs of our main results, Theorem 2 and Theorem 7.

2. Geometric Multi-Resolution Analysis (GMRA)

This section describes the main results of the paper, starting in a somewhat informal form. The statements will be made precise in the course of the exposition. In the statements below, and denote inequalities up to multiplicative constants and logarithmic factors.

Statement of results. Let σ 0 be a ﬁxed small constant, and let ε σ be given. Suppose that n ε (1+d/2), and let Xn = {X1, . . . , Xn} be an i.i.d. sample from Π, a probability distribution with density supported in a tube of radius σ around a smooth closed

Multiscale Dictionary Learning

d-dimensional manifold M , RD, with d > 1. There exists an algorithm that, given Xn, outputs the following objects:

a dictionary bΦε = {bϕi}i Jε RD;

a nonlinear encoding operator b Dε : RD RJε which takes x RD and returns the coeﬃcients of its approximation by the elements of bΦε;

a decoding operator b D 1 ε : RJε RD which maps a sequence of coeﬃcients to an element of RD.

Moreover, the following properties hold with high probability:

i. Card(Jε) ε d/2;

ii. the image of b Dε is contained in the set Sd+1 RJε of all (d+1) - sparse vectors (i.e., vectors with at most d + 1 nonzero coordinates);

iii. the reconstruction error satisﬁes

sup x support(Π) x b D 1 ε b Dε(x) ε;

iv. the time complexity for computing

bΦε is O(Cd(D + d2)ε (1+ d

2 ) log(1/ε)), where C is a universal constant;

b Dε(x) is O(d(D + d log(1/ε))), and for b D 1 ε (x) is O(d(D + log(1/ε))).

If a new observation Xn+1 from Π becomes available, bΦε may be updated in time O(Cd(D + d2) log(1/ε)).

In other words, we can construct a data-dependent dictionary bΦε of cardinality O(ε d/2) by looking at O(ε 1 d

2 ) data points drawn from Π, such that bΦε provides both (d + 1)- sparse approximations to data and has expected reconstruction error of order ε (with high probability). Note that the cost of encoding the (d+1) non-zero coeﬃcients requires O((d+ 1) log(Card(Jε))) = O(d2 log(1/ϵ)). Moreover, the algorithm producing this dictionary is fast and can be quickly updated if new points become available. We want to emphasize that the complexity of our construction only depends on the desired accuracy ε, and is independent of the total number of samples (for example, it is enough to use only the ﬁrst ε (1+d/2) data points). Many existing techniques in dictionary learning cannot guarantee a requested accuracy, or a given sparsity, and a certain computational cost as a function of the two. Our results above completely characterize the tradeoﬀs between desired precision, dictionary size, sparsity, and computational complexity for our dictionary learning procedure. We also remark that a suitable version of compressed sensing applies to the dictionary representations used in the theorem: we refer the reader to the work by Iwen and Maggioni (2013), and its applications to hyperspectral imaging by Chen et al. (2012).

Maggioni, Minsker, and Strawn

2.1 Notation

For v RD, v denotes the standard Euclidean norm in RD. Bd(0, r) is the Euclidean ball in Rd of radius r centered at the origin, and we let B(0, r) := BD(0, r). Proj V stands for the orthogonal projection onto a linear subspace V RD, dim(V ) for its dimension and V for its orthogonal complement. For x RD, let Projx+V be the aﬃne projection onto the aﬃne subspace x + V deﬁned by Projx+V (y) = x + Proj V (y x), for y RD. Given a matrix A Rk l, we write A = [a1| |al], where ai stands for the ith column of A. The operator norm is denoted by A , the Frobenius norm by A F and the matrix transpose by AT . If k = l, tr (A) denotes the trace. For v Rk, let diag(v) be the k k diagonal matrix with (diag(v))ii = vi, i = 1, . . . , k. Finally, we use span{ai}l i=1 to denote the linear span of the columns of A. Given a C2 function f : Rl Rk, let fi denote the ith coordinate of the function f for i = 1, . . . k, Df(v) the Jacobian of f at v Rl, and D2fi(v) the Hessian of the ith coordinate at v. We shall use d Vol to denote Lebesgue measure on RD, and if U RD is Lebesgue measurable, Vol(U) stands for the Lebesgue measure of U. We will use Vol M to denote the volume measure on a d-dimensional manifold M in RD (note that this coincides with the d-dimensional Hausdorﬀmeasure for the subset M of RD), UM - the uniform distribution over M, and d M(x, y) to denote the geodesic distance between two points x, y M. For a probability measure Π on RD, supp(Π) := C closed,Π(C)=1C stands for its support. Finally, for x, y R, x y := max(x, y).

2.2 Deﬁnition of the Geometric Multi-Resolution Analysis (GMRA)

We assume that the data are identically, independently distributed samples from a Borel probability measure Π on RD. Let 1 d D be an integer. A GMRA with respect to the probability measure Π consists of a collection of (nonlinear) operators {Pj : RD RD}j 0. For each resolution level j 0, Pj is uniquely deﬁned by a collection of pairs of subsets and aﬃne projections, {(Cj,k, Pj,k)}N(j) k=1 , where the subsets {Cj,k}N(j) k=1 form a measurable partition of RD (that is, members of {Cj,k}N(j) k=1 are pairwise disjoint and the union of all members is RD). Pj is constructed by piecing together local aﬃne projections. Namely, let

Pj,k(x) := cj,k + Proj Vj,k(x cj,k),

where cj,k RD and Vj,k are deﬁned as follows. Let Ej,k stand for the expectation with respect to the conditional distribution dΠj,k(x) = dΠ(x|x Cj,k). Then

cj,k = Ej,kx, (1)

Vj,k = argmin dim(V )=d Ej,k x cj,k Proj V (x cj,k) 2 , (2)

where the minimum is taken over all linear spaces V of dimension d. In other words, cj,k is the conditional mean and Vj,k is the subspace spanned by eigenvectors corresponding to d largest eigenvalues of the conditional covariance matrix

Σj,k = Ej,k[(x cj,k)(x cj,k)T ] . (3)

Multiscale Dictionary Learning

Note that we have implicitly assumed that such a subspace Vj,k is unique, which will always be the case throughout this paper. Given such a {(Cj,k, Pj,k)}N(j) k=1 , we deﬁne

k=1 I{x Cj,k}Pj,k(x)

where I{x Cj,k} is the indicator function of the set Cj,k. It was shown in the paper by Allard et al. (2012) that if Π is supported on a smooth, closed d-dimensional submanifold M , RD, and if the partitions {Cj,k}N(j) j=1 satisfy some regularity conditions for each j, then, for any x M, x Pj(x) C(M)2 2j for all j j0(M). This means that the operators Pj provide an eﬃcient compression scheme x 7 Pj(x) for x M, in the sense that every x can be well-approximated by a linear combination of at most d+1 vectors from the dictionary Φ2 2j formed by {cj,k}N(j) k=1 and the union of the bases of Vj,k, k = 1 . . . N(j). Furthermore, operators eﬃciently encoding the diﬀerence between Pj and Pj+1 were constructed, leading to a multiscale compressible representation of M. In practice, Π is unknown and we only have access to the training data Xn = {X1, . . . , Xn}, which are assumed to be i.i.d. with distribution Π. In this case, operators Pj are replaced by their estimators

k=1 I{x Cj,k} b Pj,k(x)

where {Cj,k}N(j) k=1 is a suitable partition of RD obtained from the data,

b Pj,k(x) := bcj,k + Proj b Vj,k(x bcj,k), (4)

bcj,k := 1 |Xj,k|

b Vj,k := argmin dim(V )=d

x Xj,k x bcj,k Proj V (x bcj,k) 2 ,

Xj,k = Cj,k Xn, and |Xj,k| denotes the number of elements in Xj,k. We shall call these b Pj the empirical GMRA. Moreover, the dictionary bΦ2 2j is formed by {ˆcj,k}N(j) k=1 and the union of bases of ˆVj,k, k = 1 . . . N(j). The encoding and decoding operators b D2 2j and b D 1 2 2j mentioned above are now deﬁned in the obvious way, so that b D 1 2 2j b D2 2j(x) = b Pj,k(x) for any x Cj,k. We remark that the intrinsic dimension d is assumed to be known throughout this paper. In practice, it can be estimated within the GMRA construction using the multiscale SVD ideas of Little et al. (2012, 2009). The estimation technique is based on inspecting (for a given point x Cj,k) the behavior of the singular values of the covariance matrix Σj,k as j varies. For alternative methods, see Camastra and Vinciarelli (2001); Levina and Bickel (2004) and references therein and in the review section of Little et al. (2012).

Maggioni, Minsker, and Strawn

3. Main Results

Our main goal is to obtain probabilistic, non-asymptotic bounds on the performance of the empirical GMRA under certain structural assumptions on the underlying distribution of the data. In practice, the data rarely belongs precisely to a smooth low-dimensional submanifold. One way to relax this condition is to assume that it is suﬃciently close to a reasonably regular set. Here we assume that the underlying distribution is supported in a thin tube around a manifold. We may interpret the displacement from the manifold as noise, in which case we are making no assumption on distribution of the noise besides boundedness. Another way to model this situation is to allow additive noise, whence the observations are assumed to be of the form X = Y + ξ, where Y belongs to a submanifold of RD, ξ is independent of Y , and the distribution of ξ is known. This leads to a singular deconvolution problem (see Genovese et al., 2012b; Koltchinskii, 2000). Our assumptions however may also be interpreted as relaxing the manifold assumption : even in the absence of noise we do allow data to be not exactly supported on a manifold. Our results will elucidate how the error of sparse approximation via GMRA depends on the thickness of the tube, which quantiﬁes stability and robustness properties of our algorithm. As we mentioned before, our GMRA construction is entirely data-dependent: it takes the point cloud of cardinality n as an input and for every j Z+ returns the partition {Cj,k}N(j) k=1 and associated aﬃne projectors b Pj,k. We will measure performance of the empirical GMRA by the L2(Π)-error

E X b Pj(X) 2 := Z

x b Pj(x) 2 dΠ(x) (5)

or by the ,Π-error deﬁned as Id b Pj ,Π := sup x supp(Π)

x b Pj(x) , (6)

where b Pj is deﬁned by (4). Note, in particular, that these errors are out-of-sample , i.e. measure the accuracy of the GMRA representations on all possible samples, not just those used to train the GMRA, which would not correspond to a learning problem. The presentation is structured as follows: we start from the natural decomposition x b Pj(x) x Pj(x) | {z } approximation error

+ Pj(x) b Pj(x) | {z } random error

and state the general conditions on the underlying distribution and partition scheme that suﬃce to guarantee that

1. the distribution-dependent operators Pj yield good approximation, as measured by E x Pj(x) 2: this is the bias (squared) term, which is non-random;

2. the empirical version b Pj is with high probability close to Pj, so that E b Pj(x) Pj(x) 2

is small (with high probability): this is the variance term, which is random.

Multiscale Dictionary Learning

This leads to our ﬁrst result, Theorem 2, where the error E x b Pj(x) 2 of the empirical GMRA is bounded with high probability.

We will state this ﬁrst result in a rather general setting (assumptions A1-A4) below), and after developing this general result, we consider the special but important case where the distribution Π generating the data is supported in thin tube around a smooth submanifold, and for a (concrete, eﬃciently computable, online) partition scheme we show that the conditions of Theorem 2 are satisﬁed. This is summarized in the statement of Theorem 7, that may be interpreted as proving ﬁnite-sample bounds for our GMRA-based dictionary learning scheme for high-dimensional data that suitably concentrates around a manifold. It is important to note that most of the constants in our results are explicit. The only geometric parameters involved in the bounds are the dimension d of the manifold (but not the ambient dimension D), its reach (see τ in (9)) and the tube thickness σ.

Among the existing literature, the papers Allard et al. (2012); Chen et al. (2012) introduced the idea of using multiscale geometric decomposition of data to estimate the distribution of points sampled in high-dimensions. However in the ﬁrst paper no ﬁnite sample analysis was performed, and in the second the connection with geometric properties of the distribution of the data is not made explicit, with the conditions are expressed in terms of certain approximation spaces within the space of probability distributions in RD, with Wasserstein metrics used to measure distances and approximation errors.

The recent paper by Canas et al. (2012) is close in scope to our work; its authors present probabilistic guarantees for approximating a manifold with a global solution of the so-called k-ﬂats (Bradley and Mangasarian, 2000) problem in the case of distributions supported on manifolds. It is important to note, however, that our estimator is explicitly and eﬃciently computable, while exact global solution of k-ﬂats is usually unavailable and certain approximations have to be used in practice, with convergence to a global minimum is conditioned on suitable unknown initializations. In practice it is often reported that there exist many local minima with very diﬀerent values, and good initializations are not trivial to ﬁnd. In this work we obtain better convergence rates, with fast algorithms, and we also seamlessly tackle the case of noise and model error, which is beyond what was studied previously. We consider this development extremely relevant in applications, both because real data is corrupted by noise and the assumption that data lies exactly on a smooth manifold is often unrealistic. A more detailed comparison of theoretical guarantees for k-ﬂats and for our approach is given after we state the main results in Subsection 3.2 below.

Another body of literature connected to this work studies the complexity of dictionary learning. For example, Gribonval et al. (2013) present general bounds for the convergence of global minimums of empirical risks in dictionary learning optimization problems (those results build on and generalize the works of Maurer and Pontil (2010); Vainsencher et al. (2011), among several others). While the rates obtained in those works seem to be competitive with our rates in certain regimes, the fact that their bounds must hold over entire families of dictionaries implies that those error rates generally involve a scaling constant of the order

Dk, where D is the ambient dimension and k is the number of atoms in the dictionary. Our bounds are independent of the ambient dimension D but implicitly include terms which depend upon the number of our atoms. It should be noted that the number

Maggioni, Minsker, and Strawn

of atoms in the dictionary learned by GMRA increase so as to approximate the ﬁne structure of a dataset with more precision. As such, our attainment of the minimax lower bounds for manifold estimation in the Hausdorﬀmetric (obtained in (Genovese et al., 2012a)) should be expected. While dictionaries produced from dictionary learning should reveal the ﬁne structure of a dataset through careful examination of the representations they induce, these representations are often ambiguous unless additional structure is imposed on both the dictionaries and the datasets. On the other hand, the GMRA construction induces completely unambiguous sparse representations that can be used in regression and classiﬁcation tasks with conﬁdence. In the course of the proof, we obtain several results that should be of independent interest. In particular, Lemma 18 gives upper and lower bounds for the volume of the tube around a manifold in terms of the reach (7) and tube thickness. While the exact tubular volumes are given by Weyl s tube formula (see Gray, 2004), our bound are exceedingly easy to state in terms of simple global geometric parameters. For the details on numerical implementation of GMRA and its modiﬁcations, see the works by Allard et al. (2012); Chen and Maggioni (2010).

3.1 Finite Sample Bounds for Empirical GMRA

In this section, we shall present the ﬁnite sample bounds for the empirical GMRA described above. For a ﬁxed resolution level j, we ﬁrst state suﬃcient conditions on the distribution Π and the partition {Cj,k}N(j) k=1 for which these L2(Π)-error bounds hold (see Theorem 2 below). Suppose that for all integers jmin j jmax the following is true:

(A1) There exists an integer 1 d D and a positive constant θ1 = θ1(Π) such that for all k = 1, . . . , N(j), Π(Cj,k) θ12 jd .

(A2) There is a positive constant θ2 = θ2(Π) such that for all k = 1, . . . , N(j), if X is drawn from Πj,k then, Π - almost surely,

X cj,k θ22 j .

(A3) Let λj,k 1 . . . λj,k D 0 denote the eigenvalues of the covariance matrix Σj,k (deﬁned in (3)). Then there exist σ = σ(Π) 0, θ3 = θ3(Π), θ4 = θ4(Π) > 0, and some α > 0 such that for all k = 1 . . . N(j),

λj,k d θ3 2 2j

l=d+1 λj,k l θ4(σ2 + 2 2(1+α)j) 1

If in addition

(A4) There exists θ5 = θ5(Π) such that

Id Pj ,Π θ5 σ + 2 (1+α)j ,

Multiscale Dictionary Learning

then the bounds are also guaranteed to hold for the ,Π-error (6).

i. Assumption (A1) entails that the distribution assigns a reasonable amount of probability to each partition element, assumption (A2) ensures that samples from partition elements are always within a ball around the centroid, and assumption (A3) controls the eﬀective dimensionality of the samples within each partition element. Assumption (A4) just assumes a bound on the error for the theoretical GMRA reconstruction.

ii. Note that the constants θi, i = 1 . . . 4, are independent of the resolution level j.

iii. It is easy to see that Assumption (A3) implies a bound on the local approximation error : since Pj acts on Cj,k as an aﬃne projection on the ﬁrst d principal components , we have

Ej,k x Pj(x) 2 = tr h Ej,k x cj,k Proj Vj,k(x))(x cj,k Proj Vj,k(x) T i

l=d+1 λj,k l θ4(σ2 + 2 2(1+α)j).

iv. The parameter σ is introduced to cover noisy models, including the situations when Π is supported in a thin tube of width σ around a low-dimensional manifold M. Whenever Π is supported on a smooth d-dimensional manifold, σ can be taken to be 0.

v. The stipulation

θ4(σ2 + 2 2(1+α)j) 1

guarantees that the spectral gap λj,k d λj,k d+1 is suﬃciently large.

We are in position to state our main result.

Theorem 2 Suppose that (A1)-(A3) are satisﬁed, let X, X1, . . . , Xn be an i.i.d. sample from Π, and set d := 4d2θ4 2/θ2 3. Then for any jmin j jmax and any t 1 such that t + log( d 8) 1

E X b Pj(X) 2 2θ4 σ2 + 2 2j(1+α) + c12 2j (t + log( d 8))d2

and if in addition (A4) is satisﬁed,

Id b Pj ,Π θ5 σ + 2 (1+α)j +

2 2 2j (t + log( d 8))d2

with probability 1 2jd+1

16 n2 jd , where c1 = 2 12

2 θ3 2 θ3 θ1 + 4

Maggioni, Minsker, and Strawn

3.2 Distributions Concentrated near Smooth Manifolds

Of course, the statement of Theorem 2 has little value unless assumptions (A1)-(A4) can be veriﬁed for a rich class of underlying distributions. We now introduce an important class of models and an algorithm to construct suitable partitions {Cj,k} which together satisfy these assumptions. Let M be a smooth (or at least C2, so changes of coordinate charts admit continuous second-order derivatives), closed d-dimensional submanifold of RD. We recall the deﬁnition of the reach (see Federer, 1959), an important global characteristic of M. Let

D(M) = {y RD : !x M s.t. x y = inf z M z y }, (7)

Mr = {y RD : inf x M x y < r}. (8)

reach(M) := sup{r 0 : Mr D(M)}, (9)

and we shall always use τ to denote the reach of the manifold M.

Deﬁnition 3 Assume that 0 σ < τ. We shall say that the distribution Π satisﬁes the (τ, σ)-model assumption if there exists a smooth (or at least C2), compact submanifold M , RD with reach τ such that supp(Π) = Mσ, Π and UMσ (the uniform distribution on Mσ) are absolutely continuous with respect to each other, and so Radon-Nikodym derivative dΠ d UMσ satisﬁes

0 < φ1 dΠ d UMσ φ2 < UMσalmost surely . (10)

Example 1 Consider the unit sphere of radius R in RD, SR. Then τ = R for this manifold, and for any σ < R, the uniform distribution on the set B(0, R + σ) \ B(0, R σ) satisﬁes the (σ, τ)-model assumption. On the other hand, taking the uniform distribution on a σ-thickening of the union of two line segments emanating from the origin produces a distribution which does not satisfy the (σ, τ) model assumption. In particular, τ = 0 for the underlying manifold.

Remark 4 We will implicitly assume that constants φ1 and φ2 do not depend on the ambient dimension D (or depend on a slowly growing function of D, such as log D) - the bound of Theorem 7 shows that this is the interesting case . On the other hand, we often do not need the full power of (τ, σ) - model assumption, see the Remark 9 after Theorem 7.

Our partitioning scheme is based on the data structure known as the cover tree introduced by Beygelzimer et al. (2006) (see also Ciaccia et al., 1997; Karger and Ruhl, 2002; Yianilos, 1993). We brieﬂy recall its deﬁnition and basic properties. Given a set of n distinct points Sn = {x1, . . . , xn} in some metric space (S, ρ), the cover tree T on Sn satisﬁes the following: let Tj Sn, j = 0, 1, 2, . . . be the set of nodes of T at level j. Then

Multiscale Dictionary Learning

1. Tj Tj+1;

2. for all y Tj+1, there exists z Tj such that ρ(y, z) < 2 j;

3. for all y, z Tj, ρ(y, z) > 2 j.

Remark 5 Note that these properties imply the following: for any y Sn, there exists z Tj such that ρ(y, z) < 2 j+1.

Theorem 3 in (Beygelzimer et al., 2006) shows that the cover tree always exists; for more details, see the aforementioned paper. We will construct a cover tree for the collection X1, . . . , Xn of i.i.d. samples from the distribution Π with respect to the Euclidean distance ρ(x, y) := x y . Assume that Tj := Tj(X1, . . . , Xn) = {aj,k}N(j) k=1 . Deﬁne the indexing map

k(x) := argmin 1 k N(j) x aj,k

(ties are broken by choosing the smallest value of k), and partition RD into the Voronoi regions

Cj,k = {x RD : kj(x) = k}. (11)

Let ε(n, t) be the smallest ε > 0 which satisﬁes

d β1 (log β2 + t) , (12)

where β1 = Vol M(M) cosd(δ1)Vol(Bd(0,ε/4)), β2 = Vol M(M) cosd(δ2)Vol(Bd(0,ε/8)), δ1 = arcsin(ε/8τ), and δ2 = arcsin(ε/16τ).

Remark 6 For large enough n, this requirement translates into n C(M, d, φ1) 1

for some constant C(M, d, φ1).

We are ready to state the main result of this section.

Theorem 7 Suppose that Π satisﬁes the (τ, σ)-model assumption. Let X1, . . . , Xn be an i.i.d. sample from Π, construct a cover tree T from {Xi}n i=1, and deﬁne Cj,k as in (11). Assume that ε(n, t) < σ. Then, for all j Z+ such that 2 j > 8σ and 3 2 j +σ < τ/8, the partition {Cj,k}N(j) k=1 and Π satisfy (A1), (A2), (A3), and (A4) with probability 1 e t

Maggioni, Minsker, and Strawn

θ1 = φ1Vol(Bd(0, 1))

24d Vol M(M)

θ3 = φ1/φ2 24d+8 1 + σ

θ4 = 2 2334

θ5 = 2 2232

One may combine the results of Theorem 7 and Theorem 2 as follows: given an i.i.d. sample X1, . . . , Xn from Π, use the ﬁrst n

2 points {X1, . . . , X n

2 } to obtain the partition

{Cj,k}N(j) k=1 , while the remaining {X n

2 +1, . . . , Xn} are used to construct the operator ˆPj (see (4)). This makes our GMRA construction entirely (cover tree, partitions, aﬃne linear projections) data-dependent. We observe that since our approximations are piecewise linear, they are insensitive to regularity of the manifold beyond ﬁrst order, so the estimates saturate at α = 1. When σ is very small or equal to 0, the bounds resulting from Theorem 2 can be optimized over j to get the following statement (we present only the bounds for the L2(Π) error, but the results ,Π are similar).

Corollary 8 Assume that conditions of Theorem 7 hold, and that n is suﬃciently large. Then for all A 1 such that A log n c4n, the following holds:

(a) if d {1, 2},

inf j Z:2 j<τ/24 E x b Pj(x) 2 C1

(b) if d 3,

inf j Z:2 j<τ/24 E x b Pj(x) 2 C2

with probability 1 c3n A, where C1 and C2 depend only on A, τ, d, φ1/φ2, Vol M(M) and c3, c4 depend only on τ, d, φ1/φ2, Vol M(M).

Proof In case (a), it is enough to set t := (A + 1) log n, 2 j := 16t θ1n 1/d , and apply

Theorem 2. For case (b), set t := (A + 1) log n and 2 j := A log n

Finally, we note that the claims ii. and iii. stated in the beginning of Section 2 easily follow

Multiscale Dictionary Learning

from our general results (it is enough to choose n such that ε n 2 d+2 and 2 j = ε). Claim i. follows from assumption (A1) and Theorem 7. Computational complexity bounds iv. follow from the associated computational cost estimates for the cover trees algorithm and the randomized singular value decomposition, and are discussed in detail in Sections 3 and 8 of (Allard et al., 2012).

Remark 9 It follows from our proof that it is suﬃcient to assume a weaker (but somewhat more technical) form of (τ, σ)-model condition for the conclusion of Theorem 7 to hold. Namely, let Π be the pushforward of Π under the projection Proj M : Mσ M, and assume that there exists φ1 > 0 such that for any measurable A M

Π(A) := Π Proj 1 M(A) φ1UM(A).

Moreover, suppose that there exists φ2 > 0 such that for any y M, any set A Mσ and any τ > r 2σ such that B(y, r) Mσ A B(y, 12r), we have

Π(A) φ2UMσ(A).

In some circumstances, checking these two conditions is not hard (e.g., when M is a sphere, Y is uniformly distributed on M, η is spherically symmetric noise independent of Y and such that η σ, and Π is the distribution of Y + η), but (τ, σ) - assumption does not need to hold with constants φ1 and φ2 independent of D.

3.3 Connections to Previous Work and Further Remarks

It is useful to compare our rates with results of Theorem 4 in (Canas et al., 2012). In particular, this theorem implies that, given a sample of size n from the Borel probability measure Π on the smooth d-dimensional manifold M, the L2(Π)-error of approximation of M by kn = C1(M, Π)nd/(2(d+4)) aﬃne subspaces is bounded by C2(M, Π)n 2/(d+4). Here, the dependence of kn on n is optimal in a sense that it minimizes the upper bound for the risk obtained in (Canas et al., 2012). If we set σ = 0 in our results, then it easily follows from Theorems 7 and 2 that the L2(Π)-error achieved by our GMRA construction for

1 2(d+4) (so that N(j) kn to make the results comparable) is of the same order n 2 d+4 . However, this choice of j is not optimal in this case - in particular, setting 2jn n 1 d+2 , we obtain as in (13) a L2(Π)-error of order n 2 d+2 , which is a faster rate. Moreover, we also obtain results in the sup norm, and not only for mean square error. We should note that technically our results require the stronger condition (10) on the underlying measure Π, while theoretical guarantees in (Canas et al., 2012) are obtained assuming only the upper bound dΠ d UM φ2 < , where UM := d Vol M Vol M(M) is the uniform distribution over M. The rate (13) is the same (up to log-factors) as the minimax rate obtained for the problem considered in (Genovese et al., 2012a) of estimating a manifold from the samples corrupted with the additive noise that is normal to the manifold . Our theorems are stated under more general conditions, however, we only prove robustness-type results and do not address the problem of denoising. At the same time, the estimator proposed in (Genovese et al., 2012a) is (unlike our method) not suitable for applications. The paper (Genovese et al., 2012b) considers (among other problems) the noiseless case of manifold estimation

Maggioni, Minsker, and Strawn

under Hausdorﬀloss, and obtains the minimax rate of order n 2

d . Performed numerical simulation (see Section 6) suggest that our construction also appears to achieve this rate in the noiseless case. However, our main focus is on the case σ > 0. The work of Feﬀerman et al. establishes the sampling complexity of testing the hypothesis if an unknown distribution is close to being on a manifold (with known reach, volume, dimension) in the Mean Squared sense, is also related to the work discussed in this section, and to the present one. While our results do imply that if we have enough points, as prescribed by our main theorems, and the MSE does not decay as prescribed, then the data with high probability does not satisfy the geometric assumptions in the corresponding theorem, this is still diﬀerent from the hypothesis testing problem. There may distributions not satisfying our assumptions, such that GMRA still yields good approximations: in fact we welcome and do not rule out these situations. Feﬀerman et al. also present an algorithm for constructing an approximation to the manifold; however such an algorithm does not seem easy to implement in practice. The emphasis in this work is on moving to a more general setting than the manifold setting, focusing on multiscale approaches that are robust (locally, because of SVD, as well as across scales), and fast, easily implementable algorithms. We remark that we analyze the case of one manifold M, and its perturbation in the sense of having a measure supported in a tube around M. Our construction however is multiscale and in particular local. Many extensions are immediate, for example to the case of multiple manifolds (possibly of diﬀerent dimensions) with non-intersecting tubes around them. The case of unbounded noise is also of interest: if the noise has sub-Gaussian tails then very few points are outside a tube of radius dependent on the sub-Gaussian moment, and these outliers are easily disregarded as there are few and far away, so they do not aﬀect the construction and the analysis at ﬁne scales. Another situation is when there are many gross outliers, for example points uniformly distributed in high-dimension in, say, a cube containing M. But then the volume of such cube is so large that unless the number of points is huge (at least exponential in the ambient dimension D), almost all of these points are in fact far from each other and from M with very high probability, so that again they do aﬀect the analysis and the algorithms. These are some of the advantages of the multiscale approach, which would otherwise have the potential of corrupting the results (or complicating the analysis of) other global algorithms, such as k-ﬂats.

4. Preliminaries

This section contains the remaining deﬁnitions and preliminary technical facts that will be used in the proofs of our main results. Given a point y on the manifold M, let Ty M be the associated tangent space, and let T y M be the orthogonal complement of Ty M in RD. We deﬁne the projection from the tube Mσ (see (8)) onto the manifold Proj M : Mσ M by

Proj M(x) = argmin y M x y

and note that σ < τ, together with (7), implies that Proj M is well-deﬁned on Mσ, and

Proj M(y + ξ) = y

Multiscale Dictionary Learning

whenever y M and ξ T y M B(0, σ). Next, we recall some facts about the volumes of parallelotopes that will prove useful in Section 5. For a matrix A Rk l with l k, we shall abuse our previous notation and let Vol(A) also denote the volume of the parallelotope formed by the columns of A. Let A and B be k l1 and k l2 matrices respectively with l1 + l2 k, and note that

Vol([A | B]) Vol(A)Vol(B)

where ([A | B]) denotes the concatenation of A and B into a k (l1 + l2) matrix. Moreover, if the columns of A and B are all mutually orthogonal, we clearly have that Vol([A | B]) =

Vol(A)Vol(B). Assuming that I is the l1 l1 identity matrix, we have the bound Vol A I

1. The following proposition gives volume bounds for speciﬁc types of perturbations that we shall encounter.

Proposition 10 Suppose Y = [y1| |yd] is symmetric d by d matrix such that Y q < 1. Then

Vol I + Y X

(1 + q)d Vol I X

Vol I + Y XT

(1 q)d Vol I XT

This proof (as well as the proofs of our other supporting technical contributions) is given in the Appendix. Finally, let us recall several important geometric consequences involving the reach:

Proposition 11 The following holds:

i. For all x, y M such that x y τ/2, we have

d M(x, y) τ τ

ii. Let γ(t) : [0, 1] 7 M be the arclength-parameterized geodesic. Then γ (t) 1

τ for all t.

iii. Let φ be the angle between Tx M and Ty M, in other words,

cos(φ) := min u Tx M, u =1 max v Ty M, v =1 | u, v | .

2, then cos(φ) q

iv. If x is such that x y < τ/2, then x is a regular point of Proj y+Ty M : B(y, τ/2) M y + Ty M (in other words, the Jacobian of Proj y+Ty M at x is nonsingular).

v. Let y M, r < τ and A = M B(y, r). Then

Bd(y, r cos(θ)) Proj y+Ty M(A),

where θ = arcsin r

Maggioni, Minsker, and Strawn

Proof Part i. is the statement of Proposition 6.3 and part ii. - of Proposition 6.1 in (Niyogi et al., 2008). Part iii. is demonstrated in Lemma 5.4 of the same paper, and this lemma coincides with iv. Part v. is proven in Lemma 5.3 of (Niyogi et al., 2008).

5. Proofs of the Main Results

The rest of the paper is devoted to the proofs of our main results.

5.1 Overview of the Proofs

We begin by providing an overview of the main steps of the proofs to aid comprehension. The proof of Theorem 2 begins by invoking the bias-variance decomposition:

x b Pj(x) 2 2 x Pj(x) 2 + 2 Pj(x) b Pj(x) 2.

Remark 1, part iii. and the decomposition

E X Pj(X) 2 =

k=1 Π(Cj,k)Ej,k X Pj(X) 2

gives us the ﬁrst term in the bound of Theorem 2. Note that this contribution is deterministic. The next step in the proof is to bound the stochastic error E Pj(x) b Pj(x) 2 with high probability. We start with the bound

Pj(x) b Pj(x) = cj,k bcj,k + Proj Vj,k(x cj,k) Proj b Vj,k(x cj,k + cj,k bcj,k) (14)

2 cj,k bcj,k + Proj Vj,k Proj b Vj,k x cj,k . (15)

for x Cj,k. We then use concentration of measure results (matrix Bernstein-type inequality) to bound the terms

cj,k bcj,k and bΣj,k Σj,k

with high probability. The latter bound and Assumption (A3) allows us to invoke Theorem 14 to obtain a bound of the form

Proj Vj,k Proj b Vj,k C bΣj,k Σj,k .

Finally, the term x cj,k is controlled by Assumption (A2). The proof of Theorem 7 is primarily supported by a volume comparison theorem that allows for the cancellation of the noisy terms that would imply dependency on D. That is, supposing that Proj M : Mσ M is the projection from the σ-tubular neighborhood onto the underlying manifold with reach τ, if U M is Vol M-measurable with Vol M(U) > 0, we have that 1 σ

d Vol(Proj 1 M(U)) Vol M(U)Vol(BD d(0, σ)) 1 + σ

Multiscale Dictionary Learning

This is encapsulated in Lemma 18. This allows us to relate probabilities on the tubular neighborhood with probabilities on the manifold itself, which only involve d-dimensional volumes. The ﬁrst thing that this allows us to do is to ensure that a suﬃciently large sample from UMσ, {Xi}N i=1, has that {Proj M(Xi)}N i=1 is an ε-net for M. Running the cover tree algorithm at the appropriate scale and invoking the cover tree properties at this scale yields the constant for Assumption (A2). Cover tree properties also ensure that each partition element contains a large enough portion of the tubular neighborhood, which we then relate to a portions of the manifold whose volume is comparable to d-dimensional Euclidean volumes. This approach provides the constant for Assumption (A1). Finally, the constants from Assumption (A3) and (A4) are obtained from local moment estimates based upon these volume bounds. Now, the volume comparison bounds themselves are proven by considering coordinate systems that locally invert orthogonal projections onto tangent spaces. The fact that the manifold has reach τ imposes bounds on the Jacobians and second-order terms for these local inversions. These bounds are ultimately used to bound volume distortions, and lead to the volume comparison result above.

5.2 Proof of Theorem 2

Assumption (A3) above controls the L2(Π) approximation error of x M by Pj(x) (see Remark 1, part iii.), hence we will concentrate on the stochastic error b Pj(x) Pj(x) . To this end, we will need to estimate cj,k bcj,k and Proj Vj,k Proj b Vj,k , k = 1 . . . N(j). One of the main tools required to obtain this bound is the noncommutative Bernstein s inequality.

Theorem 12 (Minsker, 2013, Theorem 2.1) Let Z1, . . . , Zn RD D be a sequence of independent symmetric random matrices such that EZi = 0 and Zi U a.s., 1 i n. Let

Then for any t 1

t + log( D), U(t + log( D)) (16)

with probability 1 e t, where D := 4 tr n P

Note that we always have D 4D. We use this inequality to estimate bΣj,k Σj,k : let Π(dx|A) be the conditional distribution of X given that X A, and set Πj,k(dx) :=

Π(dx|Cj,k). Let mj,k := n P

i=1 I{Xi Cj,k} to be the number of samples in Cj,k, k =

1 . . . N(j). Let I {1, . . . , n} be such that |I| = m. Conditionally on the event AI :=

Maggioni, Minsker, and Strawn

{Xi Cj,k for i I , and Xi / Cj,k for i / I}, the random variables {Xi, i I} are independent with distribution Πj,k. Then

Pr bΣj,k Σj,k s | mj,k = m = X

I {1,...,n},|I|=m Pr bΣj,k Σj,k s | AI 1 n m (17)

= Pr bΣj,k Σj,k s | A{1,...,m} .

To estimate Pr bΣj,k Σj,k s | A{1,...,m} , we use the following inequality. Recall that

d = 4d2 θ4 2 θ2 3 ,

where θ2, θ3 are the constants in Assumptions (A2) and (A3).

Lemma 13 Let X, X1, . . . , Xm be an i.i.d. sample from Πj,k. Set

i=1 Xi and bΣj,k := 1

i=1 (Xi bcj,k)(Xi bcj,k)T .

Assume that m t + log( d 8). Then with probability 1 2e t,

bΣj,k Σj,k 6r2 r

t + log( d 8)

Proof We want to estimate

bΣj,k Σj,k =

i=1 (Xi cj,k)(Xi cj,k)T Σj,k + (cj,k bcj,k)(cj,k bcj,k)T

i=1 (Xi cj,k)(Xi cj,k)T Σj,k

+ (cj,k bcj,k)(cj,k bcj,k)T . (18)

Set r := θ2 2 j. Recall that x cj,k r for all x, y Ci,j by assumption (A2). It implies that

1. for all 1 i m, (Xi cj,k)(Xi cj,k)T r2 almost surely,

2. E h (Xi cj,k)(Xi cj,k)T i2 = E Xi cj,k 2(Xi cj,k)(Xi cj,k)T r2 Σj,k .

Therefore, by Theorem 12 applied to Zi := 1

m(Xi cj,k)(Xi cj,k)T , i = 1 . . . m,

i=1 (Xi cj,k)(Xi cj,k)T Σj,k

(t + log( d)) Σj,k

m r2 t + log( d)

(t + log( d))

t + log( d)

Multiscale Dictionary Learning

with probability 1 e t. Note that Σj,k tr (Σj,k) r2. Moreover,

D = 4tr (EZ2 1) EZ2 1 4E(tr Z1)2

λj,k d 2 4d2 r4

θ2 32 4j = 4d2 θ4 2 θ2 3 = d

by assumption (A3) and the deﬁnition of r. Since t+log( d)

m 1 by assumption, 1 m

i=1 (Xi cj,k)(Xi cj,k) Σj,k

t + log( d)

For the second term in (18), note that (cj,k bcj,k)(cj,k bcj,k)T = cj,k bcj,k 2. We apply Theorem 12 to the symmetric matrices

Gi := 0 (Xi cj,k)T

Noting that Gi = Xi cj,k r almost surely,

EG2 i = E Xi cj,k 2 = tr (Σj,k) r2,

and tr (EG2 i ) EG2 i = 2, we get that for all t such that t + log 8 m, with probability 1 e t

bcj,k cj,k 2

(t + log 8)

m rt + log 8

hence with the same probability

(cj,k bcj,k)(cj,k bcj,k)T 4r2 t + log 8

and the claim follows.

Given the previous result, we can estimate the angle between the eigenspaces of bΣj,k and Σj,k:

Theorem 14 (Davis and Kahan, 1970), or (Zwald and Blanchard, 2006, Theorem 3). Let δd = δd(Σj,k) := 1

2(λj,k d λj,k d+1). If bΣj,k Σj,k < δd/2, then

Proj Vj,k Proj b Vj,k

Since δd θ3 2θ2 2 r2

d by assumption (A3), the previous result implies that, conditionally on

the event {mj,k = m}, with probability 1 2e t,

Proj Vj,k Proj b Vj,k

t + log( d 8)

Maggioni, Minsker, and Strawn

It remains to obtain the unconditional bound. Set nj,k := nΠ(Cj,k) and note that nj,k θ1n2 jd by assumption (A1). To this end, we have

max k=1...N(j)

Proj Vj,k Proj b Vj,k

(t + log( d 8))d2

max k=1...N(j)

Proj Vj,k Proj b Vj,k

(t + log( d 8))d2

mj,k nj,k/2, k = 1 . . . N(j)

k=1 {mj,k < nj,k/2}

k=1 Pr (mj,k < nj,k/2) .

Recall that mj,k = n P

i=1 I{Xi Cj,k}, hence Emj,k = nj,k and Var(mj,k) nj,k. Bernstein s

inequality (see Lemma 2.2.9 in van der Vaart and Wellner, 1996) implies that

|mj,k nj,k| 2 snj,k 4

with probability 1 e s. Choosing s = nj,k

16 , we deduce that Pr (mj,k < nj,k/2)

16 n2 jd, and, since N(j) 1

θ1 2jd by assumption (A1),

k=1 Pr (mj,k < nj,k/2) 1

max k=1...N(j)

Proj Vj,k Proj b Vj,k

(t + log( d 8))d2

A similar argument implies that

max k=1...N(j) cj,k bcj,k 2r

t + log( d 8)

16 n2 jd . (21)

We are in position to conclude the proof of Theorem 2. With assumption (A2), (20), and (21), the initial bound (14) implies that, with high probability,

Pj(x) b Pj(x) 4

2 θ2 θ1 2 j r

t + log( d 8)

2 θ3 2 θ3 θ1 2 j r

(t + log( d 8))d2

Combined with assumption (A3) (see Remark 1, part iii.), this yields the result.

5.3 Proof of Theorem 7

Recall that M , RD is a smooth (or at least C2) compact manifold without boundary, with reach τ, and equipped with the volume measure d Vol M. Our proof is divided into several steps, and each of them is presented in a separate subsection to improve readability.

Multiscale Dictionary Learning

5.3.1 Local Inversions of the Projection

In this section, we introduce lemmas which ensure that (for r < τ/8) the projection map Proj y+Ty M is injective on B(y, r) M, and hence invertible by part iv. of Proposition 11. We also demonstrate that the derivatives of this inverse are bounded in a suitable sense. These estimates shall allow us to develop bounds on volumes in Mσ. We begin by proving a bound on the local deviation of the manifold from a tangent plane.

Lemma 15 Suppose η T y M with η = 1 and z B(y, r) M, where r τ/2. Then

| η, z y | 2r2

Our next lemma quantitatively establishes the local injectivity of the aﬃne projections onto tangent spaces. 1

Lemma 16 Suppose y M and r < τ/8. Then Proj y+Ty M : B(y, r) M y + Ty M is injective.

There are two important conclusions that Lemma 16 provides. First of all, it indicates that, under a certain radius bound, the manifold does not curve back into particular regions. This is helpful when we begin to examine upper bounds on local volumes. More importantly, if we let Jy,r = Proj y+Ty M(B(y, r) M), then there is a well-deﬁned inverse map f of Proj y+Ty M, f : Jy,r B(y, r) M, when r < τ/8. Part iv of Proposition 11 implies that f is at least a C2 function, and part v of Proposition 11 implies that there is a d-dimensional ball inside of Jy,r of radius cos(θ)r, where θ = arcsin(r/2τ). Whenever we refer to such an f, we think of Jy,r as a subset in the span of the ﬁrst d canonical directions, and we identify f with the value f takes in the span of the remaining D d directions. Thus, we identify f with the function whose graph is a small part of the manifold. Such an identiﬁcation is obtained via an aﬃne transformation, so we may do this without any loss of generality. Using these assumptions, we may prove the following bounds.

Proposition 17 Let ε < τ/8, and assume f is deﬁned above so that v 7 v f(v)

inverse of Proj y+Ty M in B(y, ε) for some y M. Then

sup v Bd(0,ε) Df(v) 2ε τ 2ε (22)

sup v Bd(0,ε) sup u SD d 1

i=1 ui D2fi(v)

(τ 2ε)3 . (23)

1. In an independent work, Eftekhari and Wakin (2013) prove a slightly stronger result that holds for r < τ/4.

Maggioni, Minsker, and Strawn

5.3.2 Volume Bounds

The main result of this section is Lemma 18, which allows us to compare volumes in Mσ with volumes in M. It also establishes an upper bound on volumes, which is an essential ingredient when we control the conditional distribution of Π subject to being in a particular Cj,k. The form of the bounds also allows us to cancel out noisy terms that would make the estimates depend upon the ambient dimension D.

Lemma 18 Suppose σ < τ, suppose U M is measurable, and deﬁne P : Mσ M so that x 7 Proj M(x) under P. Then

d Vol M(U)Vol(BD d(0, σ)) Vol(P 1(U)) 1 + σ

d Vol M(U)Vol(BD d(0, σ))

ii. If r + σ τ/8, then

Vol(Mσ B(y, r)) 1 + σ

1 + 2(r + σ) τ 2(r + σ)

Vol(Bd(0, r +σ))Vol(BD d(0, σ)).

5.3.3 Absolute Continuity of the Pushforward of UMσ, and Local Moments

Recall that UMσ is the uniform distribution over Mσ, and let UM := d Vol M Vol M(M) be the uniform distribution over M. In this section, we exploit the volume bounds of the previous subsection to obtain control over probabilities and local moments of UMσ. Our ﬁrst result allows us to get the lower bounds for UMσ that are independent of the ambient dimension D.

Lemma 19 Suppose σ < τ, and let e UMσ denote the pushforward of UMσ under Proj M. Then e UMσ and UM are mutually absolutely continuous with respect to each other, and

Proof This is a straightforward consequence of part i. of Lemma 18.

The next lemma quantitatively establishes the the decay of the local eigenvalues required in the second part of Assumption (A3).

Lemma 20 Suppose Π is a distribution supported on Mσ, and let r < τ/2. Further assume that Z is the random variable drawn from Π conditioned on the event Z Q where Mσ Q B(y, r) for some y M. If Σ is the covariance matrix of Z, then

i=d+1 λi(Σ) 2σ2 + 8r4

where λi(Σ) are the eigenvalues of Σ arranged in the decreasing order.

Multiscale Dictionary Learning

Finally, we derive a lower bound on the upper eigenvalues of the local covariance for the uniform distribution (needed to satisfy the ﬁrst part of assumption (A3)). This is done in the following lemma.

Lemma 21 Suppose that Q RD is such that

B(y, r1) Q and Mσ Q B(y, r2)

for some y M and σ < r1 < r2 < τ/8 σ. Let Z be drawn from UMσ conditioned on the event Z Q, and suppose Σ is the covariance matrix of Z. Then

1 + 2(r2+σ) τ 2(r2+σ) 2

d/2 (r1 σ)2

The following statement is key to establishing the error bounds for GMRA measured in sup-norm.

Lemma 22 Assume that conditions of Lemma 21 hold, and let Vd := Vd(Σ) be the subspace corresponding to the ﬁrst d principal components of Z. Then

sup x Q x EZ Proj Vd(x EZ) 2σ + 4r2 2 τ + r2 r1 σ

4σ2 + 16r4 2 τ 2 γ(σ, τ, d, r1, r2),

where γ(σ, τ, d, r1, r2) = 4

τ d/2 r2+σ r1 σ d/2 1+ 2(r2+σ) τ 2(r2+σ) 2

Notice that the term containing γ(σ, τ, d, r1, r2) is often of smaller order, so that the approximation is essentially controlled by the maximum of σ and r2 2 τ .

5.3.4 Putting all the Bounds Together

In this ﬁnal subsection, we prove Theorem 7. We begin by translating Proposition 3.2 in (Niyogi et al., 2008) into our setting. As before, let Xn = {X1, . . . , Xn} be an i.i.d. sample from Π, and the φ1 be the constant deﬁned by (10).

Proposition 23 (Niyogi et al., 2008, Proposition 3.2) Suppose 0 < ε < τ

2, and also that n and t satisfy

d β1 log(ε dβ2) + t , (24)

where β1 = Vol M(M) cosd(δ1)Vol(Bd(0,1/4)), β2 = Vol M(M) cosd(δ2)Vol(Bd(0,1/8)), δ1 = arcsin(ε/8τ), and δ2 = arcsin(ε/16τ). Let Eε/2,n be the event that

Y = {Yj = Proj M(Xj)}n j=1

is ε/2-dense in M (that is, M n S

i=1 B(Yi, ε/2)). Then, Πn(Eε,n) 1 e t, where Πn is

the n-fold product measure of Π.

Maggioni, Minsker, and Strawn

Proof The proof closely follows the one given in (Niyogi et al., 2008). The only additional observation to make is that, if eΠ is the pushforward measure of Π under Proj M : Mσ M, then

eΠ (M B(y, ε/8)) = Π(Proj 1 M(M B(y, ε/8)))

φ1UMσ(Proj 1 M(M B(y, ε/8)))

= φ1 e UMσ(M B(y, ε/8))

d UM(M B(y, ε/8)).

by Lemma 18.

If ε τ, previous proposition implies that we roughly need n Const(M, d) 1

ε points to get an ε-net for M. For the remainder of this section, we identify ε := ε(n, t) with the smallest ε > 0 satisfying (24) in the statement of Proposition 23, and we also assume that ε < σ. Take j Z+ such that

σ < 2 j 2 < τ. (25)

Let Cj,k be the partition of RD into Voronoi cells deﬁned by (11). Recall that Tj = {aj,k}N(j) k=1 Xn is the set of nodes of the cover tree at level j, and set zj,k = Proj M(aj,k).

Lemma 24 With probability 1 e t, for all j satisfying (25) and k = 1, . . . , N(j),

B zj,k, 2 j 2 Cj,k and Cj,k Mσ B(aj,k, 3 2 j 2 + 2 j+1) B(zj,k, 3 2 j). (26)

We now use Lemma 24 to obtain bounds on the constants θi for i = 1, . . . , 4 and α. We prove a lemma for each of the assumptions (A1), (A2), and (A3) and then collect them as the proof of Theorem 7. Proof [Proof of Theorem 7] Since the hypotheses of Lemma 24 are satisﬁed with high probability, we ﬁrst obtain

Π(Cj,k) Π(B(zj,k, 2 j 2))

φ1UMσ(B(zj,k, 2 j 2))

= φ1 Vol(Mσ B(zj,k, 2 j 2))

φ1 Vol(Proj 1 M(M B(zj,k, 2 j 2 σ)))

d cos(δ)d Vol(Bd(0, 2 j 2 σ))

φ1Vol(Bd(0, 1))

24d Vol M(M)

Multiscale Dictionary Learning

where δ = arcsin((2 j 2 σ)/2τ)). Thus,

θ1 φ1Vol(Bd(0, 1))

24d Vol M(M)

Since the support is contained in a ball of radius 3 2 j, we easily obtain that θ2 12. Finally, it is not diﬃcult to deduce from Lemmas 20 and 21 that

θ3 φ1/φ2 24d+8 1 + σ

τ d , θ4 2 2334

, and α = 1.

Lemma 22 together with Lemma 24 imply that

6. Numerical Experiments

In this section, we present some numerical experiments consistent with our results.

6.1 Spheres of Varying Dimension in RD

We consider n points X1, . . . , Xn sampled i.i.d. from the uniform distribution on the unit sphere in Rd+1

M = Sd := {x Rd+1 : x = 1} .

We then embed Sd into RD for D {10, 100} by applying a random orthogonal transformation Rd+1 RD. Of course, the actual realization of this projection is irrelevant since our construction is invariant under orthogonal transformations. After performing this embedding, we add two types of noise. In the ﬁrst case, we add Gaussian noise ξ with distribution N(0, σ2

D ID): the scaling factor 1

D is chosen so that E ξ 2 = σ2. Since the norm of a Gaussian vector is tightly concentrated around its mean, this model is well-approximated by the truncated Gaussian model where the distribution of the additive noise is the same as the conditional distribution of ξ given ξ Cσ, where C is such that Cσ < 1. In this case, the constants in (1, Cσ)-model assumption would be prohibitively large, so instead we can verify the conditions given in Remark 9 directly: due to symmetry, we have that for any A Sd,

Π Proj 1 M(A) = UM(A) = UMσ Proj 1 M(A) . (27)

On the other hand, it is a simple geometric exercise to show that, for any B such that B(y, r) Mσ B B(y, 12r) and τ/2 = 1/2 > r 2Cσ,

Proj 1 M (M B(y, r1)) B Proj 1 M (M B(y, r2)) ,

Maggioni, Minsker, and Strawn

where r1 = r r

1 r2 2 and r2 = r q

3 4(1+Cσ). Lemma 18 and (27) imply that

Π(B) UMσ Proj 1 M (M B(y, r1))

(1 + Cσ)d Vol M(M B(y, r1))Vol(BD d(0, Cσ))

UMσ(B) UMσ Proj 1 M (M B(y, r2))

(1 Cσ)d Vol M(M B(y, r2))Vol(BD d(0, Cσ))

hence Π(B) φ2UMσ(B) for some φ2 independent of the ambient dimension D. We present the behavior of the L2(Π) error in this case in Figure 1, and the rate of approximation at the optimal scale as the number of samples varies in Figure 3, where it is compared to the rates obtained in Corollary 8. From Figure 1, we see that the approximations obtained satisfy our bound, and are typically better even for a modest number of samples in dimensions non-trivially low (e.g. 8000 samples on S8). In fact, the robustness with respect to sampling is such that the plots barely change from row to row. The second type of noise is uniform in the radial direction, i.e. we let η Unif[1 σ, 1+σ] and each noisy point is generated by Xi = Xi + ηi Xi ||Xi||. This is an example where the noise is not independent of X. Once again, it is easy to check directly that conditions of Remark 9 hold (the argument mimics the approach we used for the truncated Gaussian noise). Simulation results for this scenario are summarized in Figure 2, with the rate of approximation at the optimal scale again in Figure 3. We considered various settings of the parameters, namely all combinations of: d {1, 2, 4, 6, 8}, n {8000, 16000, 32000, 64000, 128000}, D {100, 1000}, σ {0, 0.05, 0.1}. We only display some of the results for reasons of space constraints. 2

6.2 Meyer s Staircase

We consider the (d-dimensional generalization of) Y. Meyer s staircase. Consider the cube Q = [0, 1]d and the set of Gaussians N(x; µ, δ2Id) where the mean µ is allowed to vary over Q, and the function is truncated to only accept arguments x Q. Varying µ in Q in this manner induces a smooth embedding of a d-dimensional manifold into the inﬁnite dimensional Hilbert space L2(Q). That is, the Gaussian density centered at µ Q and truncated to x Q is a point in L2(Q). By discretizing Q, we may sample this manifold and project it into a ﬁnite dimensional space. In particular, a grid ΓD Q of D points (obtained by subdividing in D 1

d equal parts along each dimension) may be generated and considering the evaluations of the set of translated Gaussians on this grid produces an embedding of this manifold into RD. Sampling n points from this manifold by randomly drawing µ1, . . . , µn uniformly from Q, we obtain a set {N(x; µi, δ2Id)|ΓD}i=1,...,n of n samples from the discretized Meyer s staircase in RD. This is what we call a sample from Meyer s

2. The code provided at www.math.duke.edu/~mauro/code.html can generate all the ﬁgures, re-create the data sets, and is easily modiﬁed to do more experiments.

Multiscale Dictionary Learning

-1 0 1 2 3 4 5 6 7 8 9 log1/θ(1/r)

log1/θ(Relative L2 error)

d-sphere: 8000 points, σ=0.0000., θ=0.50

d=1, σ=0 d=2, σ=0 d=4, σ=0 d=6, σ=0 d=8, σ=0

0 2 4 6 8 10 12 log1/θ(1/r)

log1/θ(Relative L2 error)

d-sphere: 64000 points, σ=0.0000., θ=0.50

d=1, σ=0 d=2, σ=0 d=4, σ=0 d=6, σ=0 d=8, σ=0

0 5 10 15 log1/θ(1/r)

log1/θ(Relative L2 error)

d-sphere: 512000 points, σ=0.0000., θ=0.50

d=1, σ=0 d=2, σ=0 d=4, σ=0 d=6, σ=0 d=8, σ=0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 log1/θ(1/r)

log1/θ(Relative L2 error)

d-sphere: 8000 points, σ=0.1000., θ=0.50

d=1, σ=0.1 d=2, σ=0.1 d=4, σ=0.1 d=6, σ=0.1 d=8, σ=0.1

-1 0 1 2 3 4 5 log1/θ(1/r)

log1/θ(Relative L2 error)

d-sphere: 64000 points, σ=0.1000., θ=0.50

d=1, σ=0.1 d=2, σ=0.1 d=4, σ=0.1 d=6, σ=0.1 d=8, σ=0.1

-2 -1 0 1 2 3 4 5 log1/θ(1/r)

log1/θ(Relative L2 error)

d-sphere: 512000 points, σ=0.1000., θ=0.50

d=1, σ=0.1 d=2, σ=0.1 d=4, σ=0.1 d=6, σ=0.1 d=8, σ=0.1

Figure 1: Experiment with Sd, without (top row) and with Gaussian noise (bottom row). The columns correspond to diﬀerent values of n {8000, 64000, 512000}. In the plots the dots represent the L2(Π) error squared (or MSE) of GMRA approximations (see (5)) as a function of the radius r at scale j; more precisely the abscissa is in terms of log2(1/rj), where rj is the mean radius of Cj,k for a ﬁxed j, and the ordinate is log2 MSEj, where MSEj is the mean squared error of the GMRA approximation at scale j. Diﬀerent colors correspond to diﬀerent intrinsic dimensions d (see legend). The two cases D = 10, 100 use the same colors for both the dots and the lines, all of which are essentially superimposed since our results are independent of the ambient dimension D. For each dimension we ﬁt a line to measure the decay, which is O(r 4) independently of d, consistently with our analysis. The horizontal dotted line, with corresponding tick mark on the Y axis, represents the noise level σ2: the approximation error ﬂattens out at roughly that level, as expected.

Maggioni, Minsker, and Strawn

-2 -1 0 1 2 3 4 5 6 7 log1/θ(1/r)

log1/θ(Relative L2 error)

d-sphere Unif Noise: 8000 points, σ=0.1000., θ=0.50

d=1, σ=0.1 d=2, σ=0.1 d=4, σ=0.1 d=6, σ=0.1 d=8, σ=0.1

-2 0 2 4 6 8 10 log1/θ(1/r)

log1/θ(Relative L2 error)

d-sphere Unif Noise: 64000 points, σ=0.1000., θ=0.50

d=1, σ=0.1 d=2, σ=0.1 d=4, σ=0.1 d=6, σ=0.1 d=8, σ=0.1

-2 0 2 4 6 8 10 log1/θ(1/r)

log1/θ(Relative L2 error)

d-sphere Unif Noise: 512000 points, σ=0.1000., θ=0.50

d=1, σ=0.1 d=2, σ=0.1 d=4, σ=0.1 d=6, σ=0.1 d=8, σ=0.1

Figure 2: This ﬁgure is as the second row of Figure 1, but the noise is radially uniform with width parameter σ. Note that the variance of the noise is σ2/3, which is indicated in the ﬁgure by horizontal line and an extra tick mark on the Y -axis in the ﬁgures. The MSE converges quickly to that level as a function of scale.

staircase, which is illustrated in Figure 4. This example is not artiﬁcial: for example, translating a white shape on a black background produces a set of 2 D images with a similar structure to the d-dimensional Meyer s staircase for d = 2. The manifold associated with Meyer s staircase is poorly approximated by subspaces of dimension smaller than O(D 1/δD), and besides spanning many dimensions in RD, it has a small reach, depending on d, D, δ. In our examples we considered

n = 8000, 16000, 32000, 640000, 128000, d = 1, 2, 4, D = 2000, and δ = 5 100.

We consider the noiseless case, as well as the case where Gaussian noise N(0, 1

DID) is added to the data. Since this type of noise does not abide by the (σ, τ)-model assumption and τ is very small for Meyer s staircase, Figure 5 illustrates the behavior of the GMRA approximation outside of the regime where our theory is applicable.

6.3 The MNIST Dataset of Handwritten Digits

We consider the MNIST data set of images of handwritten digits3, each of size 28 28, grayscale. There are total of 60, 000, from ten classes consisting of digits 0, 1, . . . , 9. The intrinsic dimension of this data set is variable across the data, perhaps because diﬀerent digits have a diﬀerent number of degrees of freedom and across scales, as it is observed in Little et al. (2012). We run GMRA by setting the cover tree scaling parameter θ equal to 0.9 (meaning that we replace 1/2 with 0.9 in deﬁnition of cover trees in section 3.2) in order to slowly zoom into the data at multiple scales. As the intrinsic dimension is not welldeﬁned, we set GMRA to pick the dimension of the planes Vj,k adaptively, as the smallest dimension needed to capture half of the energy of the data in Cj,k. The distribution of dimensions of the subspaces Vj,k has median 3 (consistently with the estimates in Little et al. (2012)) and is represented in ﬁgure 6. We then compute the L2 relative approximation error, and compute various quantiles: this is reported in the same ﬁgure. The running time on a desktop was few minutes.

3. Available at http://yann.lecun.com/exdb/mnist/.

Multiscale Dictionary Learning

4 4.2 4.4 4.6 4.8 5 5.2 5.4 5.6 log10 n

log10 MSE(jn)

d-sphere, σ=0.00

d=1; emp. rate=-1.9878; pred. rate=-2 d=2; emp. rate=-1.9985; pred. rate=-1 d=4; emp. rate=-0.84818; pred. rate=-0.66667 d=6; emp. rate=-0.70703; pred. rate=-0.5 d=8; emp. rate=-0.43763; pred. rate=-0.4

4 4.2 4.4 4.6 4.8 5 5.2 5.4 5.6 log10 n

log10 MSE(jn)

d-sphere, σ=0.05

d=1; emp. rate=-0.002607; pred. rate=-2 d=2; emp. rate=-0.012775; pred. rate=-1 d=4; emp. rate=-0.46737; pred. rate=-0.66667 d=6; emp. rate=-0.4991; pred. rate=-0.5 d=8; emp. rate=-0.43612; pred. rate=-0.4

4 4.2 4.4 4.6 4.8 5 5.2 5.4 5.6 log10 n

log10 MSE(jn)

d-sphere Unif Noise, σ=0.05

d=1; emp. rate=-0.013238; pred. rate=-2 d=2; emp. rate=-0.026275; pred. rate=-1 d=4; emp. rate=-0.66297; pred. rate=-0.66667 d=6; emp. rate=-0.61514; pred. rate=-0.5 d=8; emp. rate=-0.4263; pred. rate=-0.4

Figure 3: For the example of Sd considered in this section we consider the MSE error, i.e. L2(Π) squared error (as deﬁned in (5)) at the optimal scale jn (as in the proof of Corollary 8) as a function of the number of points n {8000, 16000, 32000, 64000, 128000, 256000, 512000}, and compare our empirical rates (solid linear, with the rate reported in the legend under emp. rate ) with the rates predicted by Corollary 8 (dotted lines, with rate reported in the legend under pred. rate ), for various choices of the intrinsic dimension d {1, 2, 4, 6, 8} and ﬁxed ambient dimension D = 10 (the results are independent of D, so we do note report the - very similar - results obtained for D = 100). Left: noiseless case, middle: Gaussian noise, right: radial uniform noise (see text). The rates match our results quite well, except in the case d = 2 where we seem to obtain the same convergence rate as in the d = 1 case. Here we are choosing the optimal scale to be the ﬁnest scale such that, in every cell, we have at least 10d2 points. For the noisy cases, the approximation rates for d = 1, 2 are not meaningful simply because we have enough points to go the ﬁnest scale above the noise level.

Maggioni, Minsker, and Strawn

Figure 4: An illustration of Meyer s staircase for d = 2. We see that the square is mapped into a subset of L2([ 1, 1]2) consisting of truncated Gaussians. These are then sampled at points on a uniform, 16 by 16 grid to obtain an embedding of [0, 1]2

into R16 16. For small δ, this embedding has a point very close to each coordinate axis in R16 16. Thus, it comes as no surprise that this embedding of [ 1, 1]2 into R256 has a high degree of curvature.

Multiscale Dictionary Learning

-4 -2 0 2 4 6 8 10 log1/θ(1/r)

log1/θ(Relative L2 error)

Meyer staircase: 128000 points, σ=0.0000., θ=0.50

d=1, σ=0 d=2, σ=0 d=4, σ=0

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 log1/θ(1/r)

log1/θ(Relative L2 error)

Meyer staircase: 128000 points, σ=0.1000., θ=0.50

d=2, σ=0.1 d=4, σ=0.1

4 4.2 4.4 4.6 4.8 5 log10 n

log10 MSE(jn)

Meyer staircase, σ=0.00

d=1; emp. rate=-2.0079; pred. rate=-2 d=2; emp. rate=-1.6265; pred. rate=-1 d=4; emp. rate=-1.4683; pred. rate=-0.66667

Figure 5: Left and middle: MSE as a function of scale r for the d-dimensional Meyer s staircase, for diﬀerent values of n =, d and σ, standard deviation of Gaussian noise N(0, σ2

D ). The small reach of Meyer s staircase makes it harder to approximate, and makes the approximation much more susceptible to noise. Moreover, Gaussian noise is unbounded, so this distribution violates the (σ, τ)-model assumption (albeit only at a small number of points, with high probability). Right: MSE at the optimal scale, chosen so that every cell contains at least 10d2 points.

0 5 10 15 20 25 30 0

Dimensions of the Scaling Subspaces

-76 -74 -72 -70 -68 -66 -64 -62 -60 -58 log1/θ(1/r)

log1/θ(Relative L2 error)

Relative squared error in L2

y= -1.2352 x - 100.1094

Figure 6: Left: histogram of the dimension of the scaling function subspaces for the MNIST data set. We see that many of the subspaces are very low-dimensional, with dimensions mostly between 2 and 5. Right: L2 relative approximation error squared as a function of scale. We do not plot the quantiles since many of them are many orders of magnitude smaller (which is a good thing in terms of approximation error), creating artifacts in the plots; they do indicate thought that the structure of the data is highly complex and not heterogeneous. Note that the axis of this plot are in log1/θ scale, where θ = 0.9 is the cover tree scaling factor used in this example. Note how the approximation error decreases slowly at the beginning, as there are many classes, rather far from each other, so that it takes a few scales before GMRA starts focusing into each class, at which point the approximation error decreases more rapidly. This phenomenon does not happen uniformly over the data (ﬁgure not shown).

Maggioni, Minsker, and Strawn

6.4 Sonata Kreutzer

We consider a recording of the ﬁrst movement of the Sonata Kreutzer by L.V. Beethoven, played by Y. Pearlman (violin) and V. Ashkenazy (piano) (EMI recordings). The recording is stereo, sampled at 44.1k Hz. We map it to mono by simply summing the two audio channels, and then we generate a high-dimensional dataset as follows. We consider windows of width w seconds, overlapping by δw seconds, and consider the samples in each such time window [iδw, iδw + w) as a high-dimensional vector X i, of dimension equal to the sampling rate times w. In our experiment we choose w = 0.1 seconds, δw = 0.05 seconds, and the resulting vectors X i are D = 551-dimensional. Since Euclidean distances between the X i are far from being perceptually relevant, we transform each X i to its cepstrum (see Oppenheim and Schafer (1975)), remove the central low-pass frequency, and discard the symmetric part of the spectrum (the signal is real), obtaining Xi, a vector with D = 275 dimensions, and i ranges from 0 to about 130, 000. The running time on a desktop was few minutes.

0 5 10 15 20 25 30 35 40 45 50 0

Dimensions of the Scaling Subspaces

-2 0 2 4 6 8 log1/θ(1/r)

log1/θ(Relative L2 error)

Relative squared error in L2

y= -3.1013 x - 10.0683

Figure 7: Left: histogram of the dimension of the scaling function subspaces for the Kreutzer sonata dataset. We see that the dimension of the scaling function subspaces is mostly between 4 and 25. Right: mean L2 relative approximation error squared as a function of scale. We do not plot the quantiles since many of them are many orders of magnitude smaller (which is a good thing in terms of approximation error), creating artifacts in the plots; they do indicate that the structure of the data is highly complex and non-heterogeneous. Note that the axes of this plot are in log1/θ scale, where θ = 0.9 is the scaling factor used in this example.

Acknowledgments

The authors gratefully acknowledge support from NSF DMS-0847388, NSF DMS-1045153, ATD-1222567, CCF-0808847, AFOSR FA9550-14-1-0033, DARPA N66001-11-1-4002. We would also like to thank Mark Iwen for his insightful comments.

Multiscale Dictionary Learning

Appendix: Proofs of Geometric Propositions and Lemmas

Proof [Proof of Proposition 10] For the ﬁrst inequality, let

and B = Y 0

and for every T [d], we let VT denote the volume of {ai}i T c {bi}i T , where ai and bi denote the ith columns of A and B respectively. By submultilinearity of the volume we have Vol(A + B) X

T 2[d] VT ,

where 2[d] = {S : S {1, . . . , d}}. We now show that VT q|T|Vol(A) for every T 2[d]. The bound Y q implies yi q for all i = 1, . . . , d, and so the fact that the volume is a submultiplicative function implies that

VT q|T|Vol(AT c).

On the other hand, letting a 1 be the orthogonal projection of a1 onto span {ai}d i=2, we note that a 1 1, and thus

Vol(A{1}c) a 1 Vol(A{1}c) = Vol(A).

By induction and invariance of the volume under permutations, we see that Vol(AT c) Vol(A) for all T 2[d]. Thus,

Vol(A + B) X

T 2[d] q|T|Vol(A) = (1 + q)d Vol(A).

For the second inequality, since Y is symmetric, we can represent it as Y = F G where F and G are symmetric positive semideﬁnite, FG = GF = 0, and F , G Y . Indeed, if Y = QΛQT is the eigenvalue decomposition of Y with Λ = diag(λ), set λ+ := (max(0, λ1), . . . , max(0, λd))T , λ := λ+ λ, and deﬁne F := Q diag(λ+)QT , G = Q diag(λ )QT . Recall the matrix determinant lemma: let T Rk k be invertible, and let U, V Rk l. Then Vol(T + UV T ) = Vol(I + V T T 1U)Vol(T).

Applying it in our case with U =

, and T = I XT

we have that

Vol I + Y XT

By orthogonality of the columns in I X

with the columns in XT

, we have that

Maggioni, Minsker, and Strawn

Therefore, we conclude that

and combining this with the expression from the matrix determinant lemma completes the proof.

Proof [Proof of Lemma 15] Let γ : [0, d M(z, y)] M denote the arclength-parameterized geodesic connecting y to z in M. Since γ is a geodesic, there is a v Ty M with v = 1 such that the Taylor expansion

z = y + d M(z, y)v + Z d M(z,y)

0 γ (t) (d M(z, y) t) dt.

By Proposition 11, γ (t) 2 1/τ for all t and d M(z, y) 2r, so we have that

| η, z y | =

η, Z d M(z,y)

0 γ (t) (d M(z, y) t) dt

0 | η, γ (t) | (d M(z, y) t) dt

0 (d M(z, y) t) dt

Proof [Proof of Lemma 16] Suppose a and b are distinct in B(y, r) M. Now, b a = v+w where v Ta M and w T a M, and note that w 2 b a 2

τ by Lemma 15. This also implies that

a b 2 4 a b 4

Multiscale Dictionary Learning

By part iii. of Proposition 11, there is a u Ty M such that u, v v cos(φ) where φ is the angle between Ty M and Ta M. Then

| u, b a | | u, v | | u, w |

It then follows from r < τ/8 that Proj Ty M(b a) = 0, and hence Proj y+Ty M(a) = Proj y+Ty M(b) and injectivity holds.

Proof [Proof of Proposition 17] For ε < τ/8, we may deﬁne the embedding v β

where we have assumed (without loss of generality) that y = 0 and Ty M coincides with the span of the ﬁrst d canonical orthonormal basis members. The domain of this map is the set

Ω= {(v, β) Rd RD d : v Ty M B(0, ε), β 2 + Df(v)T β 2 < τ 2}

and the Jacobian of this map is I + PD d i=1 βi D2fi(v) Df(v)T

It is clear that the inverse of the above map is given by

x 7 (Proj y+Ty M(Proj M(x)), Proj T y M(x Proj M(x))),

which is at least a C1 map. Thus, a necessary condition for the τ-radius normal bundle to embed is that the Jacobian exhibited above is invertible, which in turn implies that I + PD d i=1 βi D2fi(v) Df(v)T

for all ζ = 0 when (v, β) Ω. This reduces to (I +P βi D2fi(v)+Df(v)T Df(v))ζ = 0, and so a necessary condition for embedding is then that the norm of PD d i=1 βi D2fi(v) does not exceed 1 + Df(v) 2 whenever

2 = β 2 + Df(v)T β 2 < τ 2.

Maggioni, Minsker, and Strawn

In particular, this must be true if β < τ/ p

1 + Df(v) 2. This reduces to the condition that the operator norm

sup u SD d 1

i=1 ui D2fi(v)

< (1 + Df(v) 2)3/2

τ (1 + Df(v) )3 . (28)

By the fundamental theorem of calculus, we have that

Df(v)x = Df(0)x + Z v

0 [u T v D2fi(tuv)x]dt = Z v

0 [u T v D2fi(tuv)x]dt,

where uv = v/ v and [u T v D2fi(tuv)x] indicates a vector with ith component u T v D2fi(tuv)x. Consequently, for any x Rd, we have that

[u T v D2fi(tuv)x] dt v sup t [0, v ]

[u T v D2fi(tuv)x]

ε sup t [0,ε]

[u T v D2fi(tuv)x] . (29)

[u T v D2fi(tuv)x] = sup u SD d 1 u, [u T v D2fi(tuv)x] = sup u SD d 1

i=1 ui(u T v D2fi(tuv)x)

= sup u SD d 1 u T v

i=1 ui D2fi(tuv)

sup u SD d 1 uv

i=1 ui D2fi(tuv)

= x sup u SD d 1

i=1 ui D2fi(tuv)

which together with (29) and (28) yields the bound

1 + sup t [0,ε] Df(tuv)

Since this inequality also holds for any v with v ε , taking a supremum yields

sup ε [0,ε] Df(tuv) sup ε [0,ε]

1 + sup t [0,ε ] Df(tuv)

1 + sup ε [0,ε] Df(tuv)

sup v Bd(0,ε) Df(v) ε

1 + sup v Bd(0,ε) Df(v)

Multiscale Dictionary Learning

Setting a(ε ) = supv Bd(0,ε ) Df(v) , we have that a(0) = 0,

τ 1 + a(ε ) 3 ,

for all ε 0, and a is continuous by continuity of Df(v) . Setting b(ε ) = a(ε )/(1+a(ε )), we get

b(ε )(1 b(ε ))2 ε

Examining the polynomial x(1 x)2, we see that the sublevel set x(1 x)2 ω consists of two components when ω < 4/27. Also note that if ω < 1/8, then

2(1 2ω)2 = 2 8ω + 8ω2 > 2 1 = 1,

and hence 2ω(1 2ω)2 > ω.

Consequently, if x is such that x(1 x)2 ω and is in the interval containing zero in the sublevel set x(1 x)2 ω < 1/8, then x 2ω. By these observations, continuity of b(ε ), and the fact that b(0) = 0, we have that

τ , and thus

sup v Bd(0,ε) Df(v) 2ε τ 2ε.

From the bound in (28) we now acquire the bound

sup v Bd(0,ε) sup u SD d 1

i=1 ui D2fi(v)

Proof [Proof of Lemma 18] We ﬁrst prove part i. Let ε > 0 satisfy ε < τ/8. Because of (23) and the fact that β σ, we have that

i=1 βi D2fi(v)

Since this is also a bound for the columns of P βi D2fi(v), Proposition 10 implies that

Vol I + P βi D2fi(v) Df(v)T

d Vol I Df(v)T

in T (M B(y, ε)) Mσ. On the other hand, we have that

Maggioni, Minsker, and Strawn

Vol Df T (v) I

1 + fi(v) 2 1 + 4ε2

since (22) implies the bounds f(v)

vi 2ε τ 2ε for each i = 1, . . . , d, and the above is the largest this quantity may be subject to these bounds. When these estimates are joined together, we have an inequality

Vol I + PD d i=1 βi D2fi(v) Df(v)T

d Vol I Df(v)T

(D d)/2 Vol I Df(v)

For an arbitrarily small ε > 0, let {Uγ}γ Γ denote a ﬁnite partition of U into measurable sets such that there for each γ Γ, there is a yγ satisfying Uγ M B(yγ, ε). Let fγ denote the inverse of Pγ = Proj yγ+Tyγ M in Uγ, and set

Eγ,v = {β RD d : β 2 + Dfγ(v)β 2 σ2}

for all v Pγ(Uγ). Thus,

P 1 γ (Uγ) d Vol(x) = Z

Eγ,v Vol I + PD d i=1 βi D2fi(v) Df(x)T

(D d)/2 Vol I Df(v)

(D d)/2 Vol M(Uγ)Vol(BD d(0, σ))

since Eγ,v BD d(0, σ). Consequently, we have that

Vol(P 1(U)) = X

γ Γ Vol(P 1 γ (Uγ))

(D d)/2 Vol M(Uγ)Vol(BD d(0, σ))

(D d)/2 Vol M(U)Vol(BD d(0, σ)).

Since ε > 0 was arbitrary, we obtain

Vol(P 1(U)) Mσ) 1 + σ

d Vol M(U)Vol(BD d(0, σ)).

Multiscale Dictionary Learning

This completes the proof of upper bound in part i. Using a similar partition strategy, we have that Z

P 1 γ (Uγ) d Vol(x) = Z

Eγ,v Vol I + PD d i=1 βi D2fi(v) Df(x)T

d Vol I Df(v)T

d Vol I Df(v)

d Vol I Df(v)

0, σ 1+ ε τ ε

d Vol I Df(v)

d Vol M(Uγ)Vol BD d 0, 1 ε

In the inequalities above, we have used the fact that there is a ball of radius 1 ε

τ σ inside of Eγ,v for each γ and each v. Aggregating all of the sums and letting ε 0 yields the lower bound in part i. We now prove part ii. Note that

Vol(Mσ B(y, r)) Vol(P 1(M B(y, r + σ)))

since Proj M(x) y x y + Proj M(x) x r + σ. Part ii. now follows from part i. and the fact that

Vol M(M B(y, r + σ)) Z

P(M B(y,r+σ))

Vol I Df(v)

1 + 2(r + σ) τ 2(r + σ)

2!d/2 Vol(Bd(0, r + σ)).

Proof [Proof of Lemma 20] By the variational characterization of eigenvalues, we have that

i=d+1 λi(Σ) = argmin dim(V )=D d tr (Proj T V Σ Proj V )

= argmin dim(V )=D d E Proj V (Z EZ) 2

= argmin dim(V )=d E Z EZ Proj V (Z EZ) 2.

Maggioni, Minsker, and Strawn

Thus, we have that D P

i=d+1 λi(Σ) E Z EZ Proj Ty M(Z EZ) 2. Observe that

E Z EZ Proj Ty M(Z EZ) 2 =E Z y + (y EZ) Proj Ty M((Z y) + (y EZ)) 2

=E Z y Proj Ty M(Z y) 2

(y EZ) Proj Ty M(y EZ) 2

E Z y Proj Ty M(Z y) 2.

Now for any z Mσ B(y, r), we have that z = β + x where x M, and β T x M satisﬁes β σ. Moreover, there is a unique decomposition x = η +v +y where η T y M and v Ty M. Thus,

z y Proj Ty M(z y) = β + η Proj Ty Mβ β Proj Ty M(β) + η σ + 2r2

by Lemma 15, and we obtain the bound

E Z EZ Proj Ty M(Z EZ) 2 2σ2 + 8r4

This establishes the required estimate.

Proof [Proof of Lemma 21] For any unit vector u Ty M we have

E u, Z EZ 2 = 1 Vol(Q Mσ)

Q Mσ u, Z EZ 2d Vol(Z)

1 Vol(B(y, r2) Mσ)

B(y,r1) Mσ u, (Z y) E(Z y) 2d Vol(Z)

using the inclusion assumptions, and by adding and subtracting the constant vector y. We now seek to reduce the domain of integration and perform a change of variables. Since r1 τ/8, the inverse of the aﬃne projection onto y + Ty M is injective. Without loss of generality, we assume y = 0 and Ty M is the span of the ﬁrst d standard orthonormal vectors. Letting f denote the inverse of the aﬃne projection onto y +Ty M, we see that the map v β

7 v f(v) + β

is well-deﬁned and injective on Proj Ty M(M B(y, r1 σ)) (T y M B(0, σ)). Let g denote this map, note that x + β x + β (r σ) + σ = r,

for x M B(y, r1 σ), and hence the image of g is contained in Mσ B(y, r1). Since the absolute value of the determinant of the Jacobian of g is always 1 (it is lower triangular

Multiscale Dictionary Learning

with ones on the diagonal), employing the change of coordinates in the reduced domain of integration yields

E u, Z EZ 2 1 Vol(B(y, r2) Mσ)

, v f(v) + β

E(Z y) 2 dβdv,

where A = Proj Ty M(B(y, r1 σ) M), B = T y M B(0, σ).

Note that B(y, cos(θ)(r1 σ)) (y + Ty M) A. Setting Q = Proj Ty M, this immediately reduces to

E u, Z EZ 2 1 Vol(B(y, r2) Mσ)

B u, v EQ(Z y) 2dβdv

= Vol(BD d(0, σ)) Vol(B(y, r2) Mσ)

A u, v EQ(Z y) 2dv

Vol(BD d(0, σ)) Vol(B(y, r2) Mσ)

Bd(0,q) u, v EQ(Z y) 2dv,

where q = cos(δ)(r1 σ) and δ = arcsin((r1 σ)/2τ). Noting that R

Bd(0,q) u, v dv = 0 by symmetry, we now use linearity of the inner product to further reduce the integrand:

E u, Z EZ 2 Vol(BD d(0, σ)) Vol(B(y, r2) Mσ)

u, v 2 2 u, v u, EQ(Z y) + u, EQ(Z y) 2 dv

= Vol(BD d(0, σ)) Vol(B(y, r2) Mσ)

u, v 2 + u, EQ(Z y) 2 dv

Vol(BD d(0, σ)) Vol(B(y, r2) Mσ)

Bd(0,q) u, v 2dv

= Vol(BD d(0, σ))Vol(Bd(0, q))

Vol(B(y, r2) Mσ) q2

By Lemma 18, we then obtain

E u, Z EZ 2

1 + 2(r2 + σ) τ 2(r2 + σ)

d Vol(Bd(0, q)) Vol(Bd(0, r2 + σ)) q2

1 + 2(r2+σ) τ 2(r2+σ) 2

d/2 (r1 σ)2

Let Vd 1(Σ) be a subspace corresponding to the ﬁrst d 1 principal components of Z:

Vd 1 = argmin dim(V )=d 1 E Z EZ Proj V (Z EZ) ,

and note that λd(Σ) = max0 =u V d 1 E D u u , Z EZ E2 . Since dim(V d 1) = D d + 1 and

dim(Ty M) = d, it is easy to see that V d 1 Ty M = . For any u V d 1 Ty M such that

Maggioni, Minsker, and Strawn

u = 1 it follows from Courant-Fischer characterization of λd(Σ) that

λd(Σ) E u , Z EZ 2 ,

and (32) implies the desired bound.

Proof [Proof of Lemma 22] Let Q RD be such that B(y, r1) Q and Mσ Q B(y, r2) for some y M and σ < r1 < r2 < τ/8 σ. Assume that Z is drawn from UMσ Q, let Σ be the covariance matrix of Z and Vd := Vd(Σ) - the subspace corresponding to the ﬁrst d principal components of Z. Let α [0, 1] be such that cos(φ) := minu Vd, u =1 maxv Ty M, v =1 | u, v | =

1 α2 is the cosine of the angle between Ty M and Vd. Then there exists a unit vector u (Vd)

such that max v Ty M, v =1 | u , v | α.

Indeed, let u Vd, v Ty M be unit vectors such that cos(φ) = u , v , Note that

is equal to the smallest absolute value among the nonzero singular values of the operator Proj Ty M Proj Vd. Since the spectra of the operators Proj Ty M Proj Vd and Proj Vd Proj Ty M coincide by well-known facts from linear algebra, we have that

min u Vd, u =1 max v Ty M, v =1 | u, v | = min v Ty M, v =1 max u Vd, u =1 | u, v | .

In other words, Proj Ty M(u ) = u , v v and Proj Vd(v ) = u , v u . This implies that there exists a unit vector u (Vd) such that v = v , u u + v , u u , hence u , v 2 = 1 v , u 2 = α2, so u satisﬁes the requirement. To simplify the expressions, let

ζ = 1 Vol(Q Mσ).

We shall now construct upper and lower bounds for

Q Mσ u , x EZ Proj Vd(x EZ) 2 d Vol(x) = ζ Z

Q Mσ u , x EZ 2 d Vol(x)

which together yield an estimate for α. Write u = u|| + u , where u|| Ty M and u T y M. By our choice of u , we clearly have that u|| = maxv Ty M, v =1 u , v α. Using the elementary inequality (a + b)2 a2

2 b2, we further deduce that

Q Mσ u , x EZ 2 d Vol(x) ζ Z

D u|| , x EZ E2 d Vol(x) (33)

D u , x EZ E2 d Vol(x).

It follows from the proof of Lemma 21 that

D u|| , x EZ E2 d Vol(x) α2

1 + 2(r2+σ) τ 2(r2+σ) 2

d/2 (r1 σ)2

Multiscale Dictionary Learning

For the last term in (33), Lemma 20 (see equation (31)) gives

D u , x EZ E2 d Vol(x) ζ Z

Q Mσ x EZ Proj Ty M(x EZ) 2d Vol(x)

2σ2 + 8r4 2 τ 2 ,

hence (33) yields

Q Mσ u , x EZ 2 d Vol(x) α2

1 + 2(r2+σ) τ 2(r2+σ) 2

d/2 (r1 σ)2

2σ2 8r4 2 τ 2 . (34)

On the other hand, invoking (31) once again, we have

Q Mσ u , x EZ 2 d Vol(x) 2σ2 + 8r4 2 τ 2 .

Combined with (34), this gives

1 + 2(r2+σ) τ 2(r2+σ) 2

d/2 (r1 σ)2

d 4σ2 + 16r4 2 τ 2 , (35)

and the upper bound for α follows. Notice that for any x Q Mσ,

x EZ Proj Vd(x EZ) = x y Proj Ty M(x y) + y EZ Proj Ty M(y EZ) | {z } Proj (Ty M) (y EZ)

+ (Proj Ty M Proj Vd)(x EZ).

It follows from (30) that

x y Proj Ty M(x y) = Proj T y M(x y) σ + 2r2 2 τ .

Proj (Ty M) (y EZ) = 1 Vol(Q Mσ)

Q Mσ Proj T y M(y z)d Vol(z)

1 Vol(Q Mσ)

Proj T y M(z y) d Vol(z)

σ + 2r2 2 τ .

Maggioni, Minsker, and Strawn

Finally, it is easy to see that

(Proj Ty M Proj Vd)(x EZ) Proj Ty M(x EZ) Proj Vd Proj Ty M(x EZ)

+ Proj T y M(x y) + Proj T y M(EZ y) .

Let ux := Proj Ty M(x EZ) Proj Ty M(x EZ) and note that for any x Q Mσ, Proj Ty M(x EZ) 2r2, hence

Proj Ty M(x EZ) Proj Vd Proj Ty M(x EZ) 2 (2r2)2 1 Proj Vdux 2

1 min u Ty M, u =1 max v Vd, v =1 u, v 2

Combining the previous bounds with (35) and (36), we obtain the result.

Proof [Proof of Lemma 24] Assume the event Eε/2,n = {{Y1, . . . , Yn} is an ε/2 - net in M} occurs. By Proposition 23, Pr(Eε/2,n) 1 e t. Since the elements of Tj are 2 j-separated, for any 1 k N(j), B(aj,k, 2 j 1) Cj,k. Moreover, since σ 2 j 2 and aj,k zj,k σ,

B(zj,k, 2 j 1 2 j 2) B(zj,k, 2 j 1 σ) B(aj,k, 2 j 1),

hence the inclusion B zj,k, 2 j 2 Cj,k follows. To show that Cj,k Mσ B(aj,k, 3 2 j 2 + 2 j+1), pick an arbitrary z Mσ. Note that on the event Eε/2,n, there exists y {Y1, . . . , Yn} satisfying z y ε/2 + σ. Let x(y) Xn be such that y = Proj M(x(y)). By properties of the cover trees (see Remark 5), there exists x Tj such that x(y) x 2 j+1. Then

z x z y + y x(y) + x(y) x ε/2 + 2σ + 2 j+1 3 2 j 2 + 2 j+1.

Since z was arbitrary, the result follows. Finally, B(aj,k, 3 2 j 2 + 2 j+1) B(zj,k, 3 2 j) holds since aj,k zj,k 2 j 2.

M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11): 4311 4322, 2006.

W.K. Allard, G. Chen, and M. Maggioni. Multi-scale geometric methods for data sets II: Geometric multi-resolution analysis. Applied and Computational Harmonic Analysis, 32 (3):435 462, 2012. ISSN 1063-5203.

M. Belkin and P. Niyogi. Using manifold structure for partially labelled classiﬁcation. Advances in NIPS, 15, 2003.

A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In Proceedings of the 23rd International Conference on Machine learning, pages 97 104. ACM, 2006.

Multiscale Dictionary Learning

Paul S Bradley and Olvi L Mangasarian. K-plane clustering. Journal of Global Optimization, 16(1):23 32, 2000.

F. Camastra and A. Vinciarelli. Intrinsic dimension estimation of data: an approach based on Grassberger Procaccia s algorithm. Neural Processing Letters, 14(1):27 34, 2001.

G. Canas, T. Poggio, and L. Rosasco. Learning manifolds with K-Means and K-Flats. In Advances in Neural Information Processing Systems 25, pages 2474 2482, 2012.

E. Candes and T. Tao. The Dantzig selector: statistical estimation when p is much larger than n. Annals of Statistics, (6):2313 2351, 2007. math.ST/0506081.

E. Causevic, R.R. Coifman, R. Isenhart, A. Jacquin, E.R. John, M. Maggioni, L.S. Prichep, and F.J. Warner. QEEG-based classiﬁcation with wavelet packets and microstate features for triage applications in the ER. volume 3. ICASSP Proc., May 2006.

G. Chen and Gilad Lerman. Foundations of a multi-way spectral clustering framework for hybrid linear modeling. Foundations of Computational Mathematics, 9(5):517 558, 2009.

G. Chen and M. Maggioni. Multiscale geometric wavelets for the analysis of point clouds. To appear in Proc. CISS 2010, 2010.

G. Chen and M. Maggioni. Multiscale geometric and spectral analysis of plane arrangements. In Proc. CVPR, 2011.

G. Chen, A.V. Little, and M. Maggioni. Multi-resolution geometric analysis for data in high dimensions. Proc. FFT 2011, 2011a.

G. Chen, A.V. Little, M. Maggioni, and L. Rosasco. Wavelets and Multiscale Analysis: Theory and Applications. Springer Verlag, 2011b.

G. Chen, M. Iwen, Sang Chin, and M. Maggioni. A fast multiscale framework for data in high-dimensions: Measure estimation, anomaly detection, and compressive measurements. In Visual Communications and Image Processing (VCIP), 2012 IEEE, pages 1 6, 2012.

S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientiﬁc Computing, 20(1):33 61, 1998.

P. Ciaccia, M. Patella, F. Rabitti, and P. Zezula. Indexing metric spaces with M-Tree. In SEBD, volume 97, pages 67 86, 1997.

R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker. Geometric diﬀusions as a tool for harmonic analysis and structure deﬁnition of data: Diﬀusion maps. PNAS, 102(21):7426 7431, 2005a.

R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker. Geometric diﬀusions as a tool for harmonic analysis and structure deﬁnition of data: Multiscale methods. PNAS, 102(21):7432 7438, 2005b.

Maggioni, Minsker, and Strawn

R.R. Coifman and M. Maggioni. Diﬀusion wavelets. Appl. Comp. Harm. Anal., 21(1):53 94, July 2006.

R.R. Coifman, S. Lafon, M. Maggioni, Y. Keller, A.D. Szlam, F.J. Warner, and S.W. Zucker. Geometries of sensor outputs, inference, and information processing. In Defense and Security Symposium. SPIE, May 2006.

G. David and S. Semmes. Analysis of and on uniformly rectiﬁable sets, volume 38 of Mathematical Surveys and Monographs. American Mathematical Society, Providence, RI, 1993. ISBN 0-8218-1537-7.

C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1 46, 1970.

D. Donoho. Compressed sensing. IEEE Tran. on Information Theory, 52(4):1289 1306, April 2006.

D. L. Donoho. Wedgelets: Nearly minimax estimation of edges. Annals of Statistics, 27(3): 859 897, 1999.

D. L. Donoho and C. Grimes. When does isomap recover the natural parameterization of families of articulated images? Technical report, 2002.

D. L. Donoho and C. Grimes. Hessian eigenmaps: locally linear embedding techniques for high-dimensional data. PNAS, 100(10):5591 5596, 2003.

A. Eftekhari and M. B. Wakin. New analysis of manifold embeddings and signal recovery from compressive measurements. ar Xiv preprint ar Xiv:1306.4748, 2013.

E. Elhamifar and R. Vidal. Sparse subspace clustering. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 2790 2797. IEEE, 2009.

H. Federer. Curvature measures. Transactions of the American Mathematical Society, 93 (3):418 491, 1959.

C. Feﬀerman, S. Mitter, and H. Narayanan. Testing the manifold hypothesis. Journ. A.M.S. URL http://arxiv.org/abs/1310.0425. conf. version appeared in NIPS, 2010, pages 1786 1794.

M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model ﬁtting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381 395, 1981.

C. R. Genovese, M. Perone-Paciﬁco, I. Verdinelli, and L. Wasserman. Minimax manifold estimation. J. Mach. Learn. Res., 13(1):1263 1291, May 2012a. ISSN 1532-4435.

C. R. Genovese, M. Perone-Paciﬁco, I. Verdinelli, and L. Wasserman. Manifold estimation and singular deconvolution under Hausdorﬀloss. The Annals of Statistics, 40(2):941 963, 2012b.

Multiscale Dictionary Learning

A. Gray. Tubes, volume 221 of Progress in Mathematics. Birkh auser Verlag, Basel, second edition, 2004. ISBN 3-7643-6907-8. doi: 10.1007/978-3-0348-7966-8. With a preface by Vicente Miquel.

R. Gribonval, R. Jenatton, F. Bach, M. Kleinsteuber, and M. Seibert. Sample complexity of dictionary learning and other matrix factorizations. ar Xiv:1312.3790, 2013.

J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. Kriegman. Clustering appearances of objects under varying illumination conditions. In CVPR 2003 Proceedings., volume 1, pages I 11. IEEE, 2003.

H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(4):17 441,498 520, 1933.

H. Hotelling. Relations between two sets of variates. Biometrika, 27:321 77, 1936.

M.A. Iwen and M. Maggioni. Approximation of points on low-dimensional manifolds via random linear projections. Inference & Information, 2(1):1 31, 2013.

P. W. Jones. Rectiﬁable sets and the traveling salesman problem. Inventiones Mathematicae, 102(1):1 15, 1990.

P.W. Jones, M. Maggioni, and R. Schul. Manifold parametrizations by eigenfunctions of the Laplacian and heat kernels. Proc. Nat. Acad. Sci., 105(6):1803 1808, Feb. 2008.

P.W. Jones, M. Maggioni, and R. Schul. Universal local manifold parametrizations via heat kernels and eigenfunctions of the Laplacian. Ann. Acad. Scient. Fen., 35:1 44, January 2010.

D. R. Karger and M. Ruhl. Finding nearest neighbors in growth-restricted metrics. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 741 750. ACM, 2002.

V. I. Koltchinskii. Empirical geometry of multivariate data: a deconvolution approach. Annals of Statistics, pages 591 629, 2000.

K. Kreutz-Delgado, J. F. Murray, B. D. Rao, Kjersti Engan, T.-W. Lee, and T. J. Sejnowski. Dictionary learning algorithms for sparse representation. Neural Comput., 15(2):349 396, February 2003.

E. Levina and P. J. Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in Neural Information Processing Systems, pages 777 784, 2004.

M.S. Lewicki, T.J. Sejnowski, and H. Hughes. Learning overcomplete representations. Neural Computation, 12:337 365, 1998.

A. V. Little, M. Maggioni, and L. Rosasco. Multiscale geometric methods for data sets I: Multiscale SVD, noise and curvature. Technical report, MIT, September 2012. URL http://dspace.mit.edu/handle/1721.1/72597.

Maggioni, Minsker, and Strawn

A.V. Little, Y.-M. Jung, and M. Maggioni. Multiscale estimation of intrinsic dimensionality of data sets. In Proceedings of AAAI, 2009.

G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 663 670, 2010.

Y. Ma, H. Derksen, W. Hong, and J. Wright. Segmentation of multivariate mixed data via lossy data coding and compression. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(9):1546 1562, 2007.

Y. Ma, A. Y. Yang, H. Derksen, and R. Fossum. Estimation of subspace arrangements with applications in modeling and segmenting mixed data. SIAM Review, 50(3):413 458, 2008.

J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journ. Mach. Learn. Res., 11:19 60, 2010.

A. Maurer and M. Pontil. K-dimensional coding schemes in Hilbert Spaces. IEEE Transactions on Information Theory, 56(11):5839 5846, 2010.

S. Minsker. On some extensions of Bernstein s inequality for self-adjoint operators. ar Xiv preprint ar Xiv:1112.5448, 2013.

P. Niyogi, S. Smale, and S. Weinberger. Finding the homology of submanifolds with high conﬁdence from random samples. Discrete and Computational Geometry, 39:419 441, 2008.

B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, (37), 1997.

A.V. Oppenheim and R.W. Schafer. Digital Signal Processing. Prentice-Hall, 1975.

K. Pearson. On lines and planes of closest ﬁt to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559 572, 1901.

G. Peyr e. Sparse modeling of textures. Journal of Mathematical Imaging and Vision, 34 (1):17 31, 2009.

M. Protter and M. Elad. Sparse and redundant representations and motion-estimation-free algorithm for video denoising, 2007.

I. U. Rahman, I. Drori, V. C. Stodden, D. L. Donoho, and P. Schr oder. Multiscale representations for manifold-valued data. Multiscale Modeling & Simulation, 4(4):1201 1232, 2005.

S. T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323 2326, 2000.

Multiscale Dictionary Learning

Y. Sugaya and K. Kanatani. Multi-stage unsupervised learning for multi-body motion segmentation. IEICE Transactions on Information and Systems, 87(7):1935 1942, 2004.

J. B. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319 2323, 2000.

M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analyzers. Neural Computation, 11(2):443 482, 1999.

D. Vainsencher, S. Mannor, and A. M. Bruckstein. The sample complexity of dictionary learning. J. Mach. Learn. Res., 12:3259 3281, November 2011.

A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer-Verlag, New York, 1996. With applications to statistics.

R. Vidal, Y. Ma, and S. Sastry. Generalized principal component analysis (GPCA). Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(12):1945 1959, 2005.

M. B. Wakin, D. L. Donoho, H. Choi, and R. G. Baraniuk. The multiscale structure of nondiﬀerentiable image manifolds. In SPIE Wavelets XI, pages 59141B 59141B. International Society for Optics and Photonics, 2005.

J. Yan and M. Pollefeys. A general framework for motion segmentation: Independent, articulated, rigid, non-rigid, degenerate and non-degenerate. In Computer Vision ECCV 2006, pages 94 106. Springer, 2006.

P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete algorithms, pages 311 321. Society for Industrial and Applied Mathematics, 1993.

K. Yu, T. Zhang, and Y. Gong. Nonlinear learning using local coordinate coding. In Advances in Neural Information Processing Systems, pages 2223 2231, 2009.

T. Zhang, A. Szlam, Y. Wang, and G. Lerman. Randomized hybrid linear modeling by local best-ﬁt ﬂats. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1927 1934. IEEE, 2010.

Z. Zhang and H. Zha. Principal manifolds and nonlinear dimension reduction via local tangent space alignment. SIAM Journal of Scientiﬁc Computing, 26:313 338, 2002.

L. Zwald and G. Blanchard. On the convergence of eigenspaces in kernel principal component analysis. In Advances in Neural Information Processing Systems 18, pages 1649 1656. MIT Press, Cambridge, MA, 2006.