# subgradient_descent_learns_orthogonal_dictionaries__0d99d55e.pdf Published as a conference paper at ICLR 2019 SUBGRADIENT DESCENT LEARNS ORTHOGONAL DICTIONARIES Yu Bai, Qijia Jiang & Ju Sun Stanford University {yub,qjiang2,sunju}@stanford.edu This paper concerns dictionary learning, i.e., sparse coding, a fundamental representation learning problem. We show that a subgradient descent algorithm, with random initialization, can recover orthogonal dictionaries on a natural nonsmooth, nonconvex ℓ1 minimization formulation of the problem, under mild statistical assumption on the data. This is in contrast to previous provable methods that require either expensive computation or delicate initialization schemes. Our analysis develops several tools for characterizing landscapes of nonsmooth functions, which might be of independent interest for provable training of deep networks with nonsmooth activations (e.g., Re LU), among other applications. Preliminary synthetic and real experiments corroborate our analysis and show that our algorithm works well empirically in recovering orthogonal dictionaries. 1 INTRODUCTION Dictionary learning (DL), i.e. , sparse coding, concerns the problem of learning compact representations, i.e., given data Y , one tries to find a representation basis A and coefficients X, so that Y AX where X is most sparse. DL has numerous applications especially in image processing and computer vision (Mairal et al., 2014). When posed in analytical form, DL seeks a transformation Q such that QY is sparse; in this sense DL can be considered as an (extremely!) primitive deep network (Ravishankar & Bresler, 2013). Many heuristic algorithms have been proposed to solve DL since the seminal work of Olshausen & Field (1996), most of them surprisingly effective in practice (Mairal et al., 2014; Sun et al., 2015). However, understandings on when and how DL is solvable have only recently started to emerge. Under appropriate generating models on A and X, Spielman et al. (2012) showed that complete (i.e., square, invertible) A can be recovered from Y , provided that X is ultra-sparse. Subsequent works (Agarwal et al., 2017; Arora et al., 2014; 2015; Chatterji & Bartlett, 2017; Awasthi & Vijayaraghavan, 2018) provided similar guarantees for overcomplete (i.e. fat) A, again in the ultra-sparse regime. The latter methods are invariably based on nonconvex optimization with model-dependent initialization, rendering their practicality on real data questionable. The ensuing developments have focused on breaking the sparsity barrier and addressing the practicality issue. Convex relaxations based on the sum-of-squares (SOS) SDP hierarchy can recover overcomplete A when X has linear sparsity (Barak et al., 2015; Ma et al., 2016; Schramm & Steurer, 2017), while incurring expensive computation (solving large-scale SDP s or large-scale tensor decomposition). By contrast, Sun et al. (2015) showed that complete A can be recovered in the linear sparsity regime by solving a certain nonconvex problem with arbitrary initialization. However, the second-order optimization method proposed there is still expensive. This problem is partially addressed by (Gilboa et al., 2018) which proved that the first-order gradient descent with random initialization enjoys a similar performance guarantee. A standing barrier toward practicality is dealing with nonsmooth functions. To promote sparsity in the coefficients, the ℓ1 norm is the function of choice in practical DL, as is common in modern signal processing and machine learning (Cand es, 2014): despite its nonsmoothness, this choice often admits highly scalable numerical methods, such as proximal gradient method and alternating direction The reader is welcome to refer to our ar Xiv version for future updates. Published as a conference paper at ICLR 2019 method (Mairal et al., 2014). The analyses in Sun et al. (2015); Gilboa et al. (2018), however, focused on characterizing the algorithm-independent function landscape of a certain nonconvex formulation of DL, which takes a smooth surrogate to ℓ1 to get around the nonsmoothness. The tactic smoothing there introduced substantial analysis difficulty, and broke the practical advantage of computing with the simple ℓ1 function. In this paper, we show that working directly with a natural ℓ1 norm formulation results in neat analysis and a practical algorithm. We focus on the problem of learning orthogonal dictionaries: given data {yi}i [m] generated as yi = Axi, where A Rn n is a fixed unknown orthogonal matrix and each xi Rn is an iid Bernoulli-Gaussian random vector with parameter θ (0, 1), recover A. This statistical model is the same as in previous works (Spielman et al., 2012; Sun et al., 2015). Write Y .= [y1, . . . , ym] and similarly X .= [x1, . . . , xm]. We propose to recover A by solving the following nonconvex (due to the constraint), nonsmooth (due to the objective) optimization problem: minimizeq Rn f(q) .= 1 i=1 |q yi| subject to q 2 = 1. (1.1) Based on the statistical model, q Y = q AX has the highest sparsity when q is a column of A (up to sign) so that q A is 1-sparse. Spielman et al. (2012) formalized this intuition and optimized the same objective as Eq. (1.1) with a q = 1 constraint, which only works when θ O(1/ n). Sun et al. (2015) worked with the sphere constraint but replaced the ℓ1 objective with a smooth surrogate, introducing substantial analytical and computational deficiencies as alluded above. In constrast, we show that with sufficiently many samples, the optimization landscape of formulation (1.1) is benign with high probability (over the randomness of X), and a simple Riemannian subgradient descent algorithm can provably recover A in polynomial time. Theorem 1.1 (Main result, informal version of Theorem 3.1). Assume θ [1/n, 1/2]. For m Ω(θ 2n4 log4 n), the following holds with high probability: there exists a poly(m, ϵ 1)- time algorithm, which runs Riemannian subgradient descent on formulation (1.1) from at most O(n log n) independent, uniformly random initial points, and outputs a set of vectors {ba1, . . . , ban} such that up to permutation and sign change, bai ai 2 ϵ for all i [n]. In words, our algorithm works also in the linear sparsity regime, the same as established in Sun et al. (2015); Gilboa et al. (2018), at a lower sample complexity O(n4) in contrast to the existing O(n5.5) in Sun et al. (2015). 1 As for the landscape, we show that (Theorems 3.4 and 3.6) each of the desired solutions { ai}i [n] is a local minimizer of formulation (1.1) with a sufficiently large basin of attraction so that a random initialization will land into one of the basins with at least constant probability. To obtain the result, we integrate and develop elements from nonsmooth analysis (on Riemannian manifolds), set-valued analysis, and random set theory, which might be valuable to studying other nonconvex, nonsmooth optimization problems. 1.1 RELATED WORK Dictionary learning Besides the many results sampled above, we highlight similarities of our result to Gilboa et al. (2018). Both propose first-order optimization methods with random initialization, and several quantities we work with in the proofs are the same. A defining difference is we work with the nonsmooth ℓ1 objective directly, while Gilboa et al. (2018) built on the smoothed objective from Sun et al. (2015). We put considerable emphasis on practicality: the subgradient of the nonsmooth objective is considerably cheaper to evaluate than that of the smooth objective in Sun et al. (2015), and in the algorithm we use Euclidean projection rather than exponential mapping to remain feasible again, the former is much lighter for computation. General nonsmooth analysis While nonsmooth analytic tools such as subdifferential for convex functions are now well received in machine learning and relevant communities, that for general functions are much less so. The Clarke subdifferential and relevant calculus developed for the family of locally Lipschitz functions seem to be particularly relevant, and cover several families of functions of interest, such as convex functions, differentiable functions, and many forms of composition 1The sample complexity in Gilboa et al. (2018) is not explicitly stated. Published as a conference paper at ICLR 2019 (Clarke, 1990; Aubin, 1998; Bagirov et al., 2014). Remarkably, majority of the tools and results can be generalized to locally Lipschitz functions on Riemannnian manifolds (Ledyaev & Zhu, 2007; Hosseini & Pouryayevali, 2011). Our formulation (1.1) is exactly optimization of a locally Lipschitz function (as it is convex) on a Riemannian manifold (the sphere). For simplicity, we try to avoid the full manifold language, nonetheless. Nonsmooth optimization on Riemannian manifolds or with constraints Equally remarkable is many of the smooth optimization techniques and convergence results can be naturally adapted to optimization of locally Lipschitz functions on Riemannian manifolds (Grohs & Hosseini, 2015; Hosseini, 2015; Hosseini & Uschmajew, 2017; Grohs & Hosseini, 2016). New optimization methods such as gradient sampling and variants have been invented to solve general nonsmooth problems (Burke et al., 2005; 2018; Bagirov et al., 2014; Curtis & Que, 2015; Curtis et al., 2017). Almost all available convergence results pertain to only global convergence, which is too weak for our purpose. Our specific convergence analysis gives us a local convergence result (Theorem 3.8). Nonsmooth landscape characterization Nonsmoothness is not a big optimization barrier if the problem is convex; here we review some recent work on analyzing nonconvex nonsmooth problems. Loh & Wainwright (2015) study the regularized empirical risk minimization problem with nonsmooth regularizers and show results of the type all stationary points are within statistical error of ground truth under certain restricted strong convexity of the smooth risk. Duchi & Ruan (2017); Davis et al. (2017) study the phase retrieval problem with ℓ1 loss, characterizing its nonconvex nonsmooth landscape and providing efficient algorithms. There is a recent surge of work on analyzing one-hidden-layer Re LU networks, which are nonconvex and nonsmooth. Algorithm-independent characterizations of the landscape are mostly local and require strong initialization procedures (Zhong et al., 2017), whereas stronger global results can be established via designing new loss functions (Ge et al., 2017), relating to PDEs (Mei et al., 2018), or problem-dependent analysis of the SGD (Li & Yuan, 2017; Li & Liang, 2018). Our result provides an algorithm-independent chacaterization of the landscape of non-smooth dictionary learning, and is almost global in the sense that the initialization condition is satisifed by random initialization with high probability. Other nonsmooth problems in application Prevalence of nonsmooth problems in optimal control and economics is evident from all monographs on nonsmooth analysis (Clarke, 1990; Aubin, 1998; Bagirov et al., 2014). In modern machine learning and data analysis, nonsmooth functions are often taken to encode structural information (e.g., sparsity, low-rankness, quantization), or whenever robust estimation is desired. In deep learning, the optimization problem is nonsmooth when nonsmooth activations are in use, e.g., the popular Re LU. The technical ideas around nonsmooth analysis, set-valued analysis, and random set theory that we gather and develop here are particularly relevant to these applications. 2 PRELIMINARIES Problem setup Given an unknown orthogonal dictionary A = [a1, . . . , an] Rn n, we wish to recover A through m observations of the form yi = Axi, (2.1) or Y = AX in matrix form, where X = [x1, . . . , xm] and Y = [y1, . . . , ym]. The coefficient vectors xi are sampled from the Bernoulli-Gaussian distribution with parameter θ (0, 1), denoted as BG(θ): each entry xij is independently drawn from a standard Gaussian with probability θ and zero otherwise. The Bernoulli-Gaussian is a good prototype distribution for sparse vectors, as xi will be on average θ-sparse. For any z iid Ber(θ), we let Ωdenote the set of non-zero indices, which is a random set itself. We assume that n 3 and θ [1/n, 1/2]. In particular, θ 1/n is to require that each xi has at least one non-zero entry on average. First-order geometry We will focus on the first-order geometry of the non-smooth objective Eq. (1.1): f(q) = 1 m Pm i=1|q yi|. In the whole Euclidean space Rn, f is convex with Published as a conference paper at ICLR 2019 sub-differential set i=1 sign(q yi)yi, (2.2) where sign( ) is the set-valued sign function (i.e. sign(0) = [ 1, 1]). As we minimize f subject to the constraint q 2 = 1, our problem is no longer convex. The Riemannian sub-differential of f on Sn 1 is defined as (Hosseini & Uschmajew, 2017): Rf(q) .= (I qq ) f(q). (2.3) A point q is stationary for problem Eq. (1.1) if 0 Rf(q). We will not distinguish between local maxima and saddle points we call a stationary point q a saddle point if there is a descent direction (i.e. direction along which the function is locally maximized at q). Set-valued analysis As the subdifferential is a set-valued mapping, analyzing it requires some setvalued analysis, which we briefly present here. The addition of two sets is defined as the Minkowski summation: X + Y = {x + y : x X, y Y }. The expectation of random sets is a straightforward extension of the Minkowski sum allowing any measurable selection procedure; for the concrete definition see (Molchanov, 2013). The Hausdorff distance between two sets is defined as d H (X1, X2) .= sup sup x1 X1 d (x1, X2) , sup x2 X2 d (x2, X1) . (2.4) Basic properties about the Hausdorff distance are provided in Appendix A.1. Notations Bold small letters (e.g., x) are vectors and bold capitals are matrices (e.g., X). The dotted equality .= is for definition. For any positive integer k, [k] .= {1, . . . , k}. By default, is the ℓ2 norm if applied to a vector, and the operator norm if applied to a matrix. C and c or any indexed versions are reserved for universal constants that may change from place to place. 3 MAIN RESULT We now state our main result, the recovery guarantee for learning orthogonal dictionary by solving formulation (1.1). Theorem 3.1 (Recovering orthogonal dictionary via subgradient descent). Suppose we observe m Cn4θ 2 log4 n (3.1) samples in the dictionary learning problem and we desire an accuracy ϵ (0, 1) for recovering the dictionary. With probability at least 1 exp cmθ3n 3 log 3 m exp ( c R/n), an algorithm which runs Riemannian subgradient descent R = C n log n times with independent random initializations on Sn 1 outputs a set of vectors {ba1, . . . , ban} such that up to permutation and sign change, bai ai 2 ϵ for all i [n]. The total number of subgradient descent iterations is bounded by C Rθ 16/3ϵ 8/3n4 log8/3 n. (3.2) Here C, C , C , c, c > 0 are universal constants. At a high level, the proof of Theorem 3.1 consists of the following steps, which we elaborate throughout the rest of this section. 1. Partition the sphere into 2n symmetric good sets and show certain directional gradient is strong on population objective E[f] inside the good sets (Section 3.1). 2. Show that the same geometric properties carry over to the empirical objective f with high probability. This involves proving the uniform convergence of the subdifferential set f to E [ f] (Section 3.2). 3. Under the benign geometry, establish the convergence of Riemannian subgradient descent to one of { ai : i [n]} when initialized in the corresponding good set (Section 3.3). 4. Calling the randomly initialized optimization procedure O(n log n) times will recover all of {a1, . . . , an} with high probability, by a coupon collector s argument (Section 3.4). Published as a conference paper at ICLR 2019 Scaling and rotating to identity Throughout the rest of this paper, we are going to assume WLOG that the dictionary is the identity matrix, i.e. A = In, so that Y = X, f(q) = q X 1, and the goal is to find the standard basis vectors { e1, . . . , en}. The case of a general orthogonal A can be reduced to this special case via rotating by A : q Y = q AX = (q ) X where q = A q and applying the result on q . We also scale the objective by p π/2 for convenience of later analysis. 3.1 PROPERTIES OF THE POPULATION OBJECTIVE We begin by characterizing the geometry of the expected objective E [f]. Recall that we have rotated A to be identity, so that we have q xi , f(q) = rπ i=1 sign q xi xi. (3.3) Minimizers and saddles of the population objective We begin by computing the function value and subdifferential set of the population objective and giving a complete characterization of its stationary points, i.e. local minimizers and saddles. Proposition 3.2 (Population objective value and gradient). We have E [f] (q) = rπ 2 E q x = EΩ qΩ (3.4) E [f] (q) = E [ f] (q) = rπ 2 E sign q x x = EΩ qΩ/ qΩ , qΩ = 0, {vΩ: vΩ 1} , qΩ= 0. (3.5) Proposition 3.3 (Stationary points). The stationary points of E [f] on the sphere are k q : q { 1, 0, 1}n , q 0 = k, k [n] . (3.6) The case k = 1 corresponds to the 2n global minimizers q = ei, and all other values of k correspond to saddle points. A consequence of Proposition 3.3 is that the population objective has no spurious local minima : each stationary point is either a global minimizer or a saddle point, though the problem itself is non-convex due to the constraint. Identifying 2n good subsets We now define 2n subsets on the sphere, each containing one of the global minimizers { ei} and possessing benign geometry for both the population and empirical objective, following (Gilboa et al., 2018). For any ζ [0, ) and i [n] define q : qi > 0, q2 i q i 2 1 + ζ , S(i ) ζ .= q : qi < 0, q2 i q i 2 1 + ζ For points in S(i+) ζ S(i ) ζ , the i-th index is larger than all other indices (in absolute value) by a multiplicative factor of ζ. In particular, for any point in these subsets, the largest index is unique, so by Proposition 3.3 all population saddle points are excluded from these 2n subsets. Intuitively, this partition can serve as a tiebreaker : points in S(i+) ζ0 is closer to ei than all the other 2n 1 signed basis vectors. Therefore, we hope that optimization algorithms initialized in this region could favor ei over the other standard basis vectors, which we are going to show is indeed the case. For simplicity, we are going to state our geometry results in S(n+) ζ ; by symmetry the results will automatically carry over to all the other 2n 1 subsets. Theorem 3.4 (Lower bound on directional subgradients). Fix any ζ0 (0, 1). We have (a) For all q S(n+) ζ0 and all indices j = n such that qj = 0, inf E [ Rf] (q) , 1 2nθ (1 θ) ζ0 1 + ζ0 . (3.8) Published as a conference paper at ICLR 2019 (b) For all q S(n+) ζ0 , we have that inf E [ Rf] (q) , qnq en 1 8θ (1 θ) ζ0n 3/2 q n . (3.9) These lower bounds verify our intuition: points inside S(n+) ζ0 have subgradients pointing towards en, both in a coordinate-wise sense and a combined sense: the direction en qnq is exactly the tangent direction of the sphere at q that points towards en. 3.2 BENIGN GEOMETRY OF THE EMPIRICAL OBJECTIVE We now show that the benign geometry in Theorem 3.4 is carried onto the empirical objective f given sufficiently many samples, using a concentration argument. The key result behind is the concentration of the empirical subdifferential set to the population subdifferential, where concentration is measured in the Hausdorff distance between sets. Proposition 3.5 (Uniform convergence of subdifferential). For any t (0, 1], when m Ct 2n log2(n/t), (3.10) with probability at least 1 exp cmθt2/log m , we have d H ( f (q) , E [ f] (q)) t for all q Sn 1. (3.11) Here C, c 0 are universal constants. The concentration result guarantees that the sub-differential set is close to its expectation given sufficiently many samples with high probability. Choosing an appropriate concentration level t, the lower bounds on the directional subgradients carry over to the empirical objective f, which we state in the following theorem. Theorem 3.6 (Directional subgradient lower bound, empirical objective). There exist universal constants C, c 0 so that the following holds: for all ζ0 (0, 1), when m Cn4θ 2ζ 2 0 log2 (n/ζ0), with probability at least 1 exp cmθ3ζ2 0n 3 log 1 m , the following properties hold simultane- ously for all the 2n subsets n S(i+) ζ0 , S(i ) ζ0 : i [n] o : (stated only for S(n+) ζ0 ) (a) For all q S(n+) ζ0 and all j [n] with qj = 0 and q2 n/q2 j 3, inf Rf (q) , 1 4nθ (1 θ) ζ0 1 + ζ0 . (3.12) (b) For all q S(n+) ζ0 , inf Rf (q) , qnq en 2 16 θ (1 θ) n 3 16θ (1 θ) n 3 2 ζ0 q en . The consequence of Theorem 3.6 is two-fold. First, it guarantees that the only possible stationary point of f in S(n+) ζ0 is en: for every other point q = en, property (b) guarantees that 0 / Rf(q), therefore q is non-stationary. Second, the directional subgradient lower bounds allow us to establish convergence of the Riemannian subgradient descent algorithm, in a way similar to showing convergence of unconstrained gradient descent on star strongly convex functions. We now present an upper bound on the norm of the subdifferential sets, which is needed for the convergence analysis. Proposition 3.7. There exist universal constants C, c 0 such that sup f (q) 2 q Sn 1 (3.14) with probability at least 1 exp cmθ log 1 m , provided that m Cn log n. This particularly implies that sup Rf (q) 2 q Sn 1. (3.15) Published as a conference paper at ICLR 2019 3.3 FINDING ONE BASIS VIA RIEMANNIAN SUBGRADIENT DESCENT The benign geometry of the empirical objective allows a simple Riemannian subgradient descent algorithm to find one basis vector a time. The Riemannian subgradient descent algorithm with initialization q(0) and step size η(k) k 0 is as follows. For an arbitrary v Rf q(k) , q(k+1) = q(k) η(k)v q(k) η(k)v , for k = 0, 1, 2, . . . . (3.16) Each iteration moves in an arbitrary Riemannian subgradient direction followed by a projection back onto the sphere. We show that the algorithm is guaranteed to find one basis as long as the initialization is in the right region. To give a concrete result, we set ζ0 = 1/(5 log n).2 Theorem 3.8 (One run of subgradient descent recovers one basis). Let m Cθ 2n4 log4 n and ϵ (0, 2θ/25]. With probability at least 1 exp cmθ3n 3 log 3 m the following happens. If the initialization q(0) S(n+) 1/(5 log n), and we run the projected Riemannian subgradient descent with step size η(k) = k α/(100 n) with α (0, 1/2), and keep track of the best function value so far until after iterate K is performed, producing qbest. Then, qbest obeys f qbest f (en) ϵ, and qbest en 16 θ (1 θ)ϵ, (3.17) provided that 32000n5/2 log n (1 α) 1 2αn3/2 log n In particular, choosing α = 3/8 < 1/2, it suffices to let K K3/8 .= C θ 8/3ϵ 8/3n4 log8/3 n. (3.19) Here C, C , c 0 are universal constants. The above optimization result (Theorem 3.8) shows that Riemannian subgradient descent is able to find the basis vector en when initialized in the associated region S(n+) 1/(5 log n). We now show that a simple uniformly random initialization on the sphere is guaranteed to be in one of these 2n regions with at least probability 1/2. Lemma 3.9 (Random initialization falls in good set ). Let q(0) Uniform(Sn 1), then with probability at least 1/2, q(0) belongs to one of the 2n sets n S(i+) 1/(5 log n), S(i ) 1/(5 log n) : i [n] o . 3.4 RECOVERING ALL BASES FROM MULTIPLE RUNS As long as the initialization belongs to S(i+) 1/(5 log n) or S(i ) 1/(5 log n), our finding-one-basis result in Theorem 3.8 guarantees that Riemannian subgradient descent will converge to ei or ei respectively. Therefore if we run the algorithm with independent, uniformly random initializations on the sphere multiple times, by a coupon collector s argument, we will recover all the basis vectors. This is formalized in the following theorem. Theorem 3.10 (Recovering the identity dictionary from multiple random initializations). Let m Cn4θ 2 log4 n and ϵ (0, 1), with probability at least 1 exp cmθ3n 3 log 3 m the following happens. Suppose we run the Riemannian subgradient descent algorithm independently for R times, each with a uniformly random initialization on Sn 1, and choose the step size as η(k) = k 3/8/(100 n). Then, provided that R C n log n, all standard basis vectors will be recovered up to ϵ accuracy with probability at least 1 exp ( c R/n) in C Rθ 16/3ϵ 8/3n4 log8/3 n iterations. Here C, C , c 0 are universal constants. When the dictionary A is not the identity matrix, we can apply the rotation argument sketched in the beginning of this section to get the same result, which leads to our main result in Theorem 3.1. 2It is possible to set ζ0 to other values, inducing different combinations of the final sample complexity, iteration complexity, and repetition complexity in Theorem 3.10. Published as a conference paper at ICLR 2019 4 PROOF HIGHLIGHTS A key technical challenge is establishing the uniform convergence of subdifferential sets in Proposition 3.5, which we now elaborate. Recall that the population and empirical subdifferentials are i=1 sign(q xi)xi, E [ f] (q) = rπ 2 Ex BG(θ) sign(q x)x , (4.1) and we wish to show that the difference between f(q) and E [ f] (q) is small uniformly over q Q = Sn 1. Two challenges stand out in showing such a uniform convergence: 1. The subdifferential is set-valued and random, and it is unclear a-priori how one could formulate and analyze the concentration of random sets. 2. The usual covering argument won t work here, as the Lipschitz gradient property does not hold: f(q) and E [ f] (q) are not Lipschitz in q. Therefore, no matter how fine we cover the sphere in Euclidean distance, points not in this covering can have radically different subdifferential sets. 4.1 CONCENTRATION OF RANDOM SETS We state and analyze concentration of random sets in the Hausdorff distance (defined in Section 2). We now illustrate how the Hausdorff distance is the right distance to consider for concentration of subdifferentials the reason is that the Hausdorff distance is closely related to the support function of sets, which for any set S Rn is defined as h S(u) .= sup x S x, u . (4.2) For convex compact sets, the sup difference between their support functions is exactly the Hausdorff distance. Lemma 4.1 (Section 1.3.2, Molchanov (2013)). For convex compact sets X, Y Rn, we have d H (X, Y ) = sup u Sn 1 |h X(u) h Y (u)| . (4.3) Lemma 4.1 is convenient for us in the following sense. Suppose we wish to upper bound the difference of f(q) and E [ f] (q) along some direction u Sn 1 (as we need in proving the key empirical geometry result Theorem 3.6). As both subdifferential sets are convex and compact, by Lemma 4.1 we immediately have inf g f(q) g, u inf g E[ f](q) g, u = h f(q)( u) + h E[ f](q)( u) d H( f(q), E [ f] (q)). Therefore, as long as we are able to bound the Hausdorff distance, all directional differences between the subdifferentials are simultaneously bounded, which is exactly what we want to show to carry the benign geometry from the population to the empirical objective. 4.2 COVERING IN THE d E METRIC We argue that the absence of gradient Lipschitzness is because the Euclidean distance is not the right metric in this problem. Think of the toy example f(x) = |x|, whose subdifferential set f(x) = sign(x) is not Lipschitz across x = 0. However, once we partition R into R>0, R<0 and {0} (i.e. according to the sign pattern), the subdifferential set is Lipschitz on each subset. The situation with the dictionary learning objective is quie similar: we resolve the gradient non Lipschitzness by proposing a stronger metric d E on the sphere which is sign-pattern aware and averages all subset angles between two points. Formally, we define d E as d E(p, q) .= Px BG(θ) sign(p x) = sign(q x) = 1 π EΩ (pΩ, qΩ), (4.5) Published as a conference paper at ICLR 2019 (the second equality shown in Lemma C.1.) Our plan is to perform the covering argument in d E, which requires showing gradient Lipschitzness in d E and bounding the covering number. Lipschitzness of f and E [ f] in d E For the population subdifferential E [ f], note that E [ f] (q) = Ex BG(θ)[sign(q x)x] (modulo rescaling). Therefore, to bound d H(E [ f] (p), E [ f] (q)) by Lemma 4.1, we have the bound for all u Sn 1 h E[ f](p)(u) h E[ f](q)(u) = E sup sign(p x) sign(q x) x u (4.6) 2E 1 sign(p x) = sign(q x) x u . (4.7) As long as d E(p, q) ϵ, the indicator is non-zero with probability at most ϵ, and thus the above expectation should also be small we bound it by O(ϵ p log(1/ϵ)) in Lemma F.5. To show the same for the empirical subdifferential f, one only needs to bound the observed proportion of sign differences for all p, q such that d E(p, q) ϵ, which by a VC dimension argument is uniformly bounded by 2ϵ with high probability (Lemma C.5). Bounding the covering number in d E Our first step is to reduce d E to the maximum length-2 angle (the d2 metric) over any consistent support pattern. This is achieved through the following vector angle inequality (Lemma C.2): for any p, q Rd (d 3), we have Ω [d],|Ω|=2 (pΩ, qΩ) provided (p, q) π/2. (4.8) Therefore, as long as sign(p) = sign(q) (coordinate-wise) and max|Ω|=2 (pΩ, qΩ) ϵ/n2, we would have for all |Ω| 3 that (pΩ, qΩ) π/2 and (pΩ, qΩ) X (pΩ , qΩ ) |Ω| 2 n2 ϵ. (4.9) By Eq. (4.5), the above implies that d E(p, q) ϵ/π, the desired result. Hence the task reduces to constructing an η = ϵ/n2 covering in d2 over any consistent sign pattern. Our second step is a tight bound on this covering number: the η-covering number in d2 is bounded by exp(Cn log(n/η)) (Lemma C.3). For bounding this, a first thought would be to take the covering in all size-2 angles (there are n 2 of them) and take the common refinement of all their partitions, which gives covering number (C/η)O(n2) = exp(Cn2 log(1/η)). We improve upon this strategy by sorting the coordinates in p and restricting attentions in the consecutive size-2 angles after the sorting (there are n 1 of them). We show that a proper covering in these consecutive size-2 angles by η/n will yield a covering for all size-2 angles by η. The corresponding covering number in this case is thus (Cn/η)O(n) = exp(Cn log(n/η)), which modulo the log n factor is the tightest we can get. 5 EXPERIMENTS Setup We set the true dictionary A to be the identity and random orthogonal matrices, respectively. For each choice, we sweep the combinations of (m, n) with n {30, 50, 70, 100} and m = 10n{0.5,1,1.5,2,2.5}, and fix the sparsity level at θ = 0.1, 0.3, 0.5, respectively. For each (m, n) pair, we generate 10 problem instances, corresponding to re-sampling the coefficient matrix X for 10 times. Note that our theoretical guarantee applies for m = eΩ(n4), and the sample complexity we experiment with here is lower than what our theory requires. To recover the dictionary, we run the Riemannian subgradient descent algorithm Eq. (3.16) with decaying step size η(k) = 1/ k, corresponding to the boundary case α = 1/2 in Theorem 3.8 with a much better base size. Metric As Theorem 3.1 guarantees recovering the entire dictionary with R Cn log n independent runs, we perform R = round (5n log n) runs on each instance. For each run, a true dictionary element ai is considered to be found if ai qbest 10 3. For each instance, we regard it a successful recovery if the R = round (5n log n) runs have found all the dictionary elements, and we report the empirical success rate over the 10 instances. Published as a conference paper at ICLR 2019 Result From our simulations, Riemannian subgradient descent succeeds in recovering the dictionary as long as m Cn2 (Fig. 2), across different sparsity level θ. The dependency on n is consistent with our theory and suggests that the actual sample complexity requirement for guaranteed recovery might be even lower than e O(n4) we established.3 The e O(n2) rate we observe also matches the results based on the SOS method (Barak et al., 2015; Ma et al., 2016; Schramm & Steurer, 2017). Moreover, the problem seems to become harder when θ grows, evident from the observation that the success transition threshold being pushed to the right. 0.5 1 1.5 2 2.5 number of measurements (as power of n) fraction of successful recovery n=30 n=50 n=70 n=100 0.5 1 1.5 2 2.5 number of measurements (as power of n) fraction of successful recovery n=30 n=50 n=70 n=100 0.5 1 1.5 2 2.5 number of measurements (as power of n) fraction of successful recovery n=30 n=50 n=70 n=100 Figure 1: Empirical success rates of recovery of the Riemannian subgradient descent with R = 5n log n runs, averaged over 10 instances. Left to right: identitiy dictionaries with θ = 0.1, 0.3, 0.5. See Appendix G for the results on orthogonal dictionaries, which have qualitatively the same behaviors. Additional experiments A faster alternative algorithm for large-scale instances is tested in Appendix H. A complementary experiment on real images is included as Appendix I. 6 CONCLUSION AND FUTURE DIRECTIONS This paper presents the first theoretical guarantee for orthogonal dictionary learning using subgradient descent on a natural ℓ1 minimization formulation. Along the way, we develop tools for analyzing the optimization landscape of nonconvex nonsmooth functions, which could be of broader interest. For futute work, there is an O(n2) sample complexity gap between what we established in Theorem 3.1, and what we observed in the simulations alongside previous results based on the SOS method (Barak et al., 2015; Ma et al., 2016; Schramm & Steurer, 2017). As our main geometric result Theorem 3.6 already achieved tight bounds on the directional derivatives, further sample complexity improvement could potentially come out of utilizing second-order information such as the strong negative curvature (Lemma B.2), or careful algorithm-dependent analysis. While our result applies only to (complete) orthogonal dictionaries, a natural question is whether we can generalize to overcomplete dictionaries. To date the only known provable algorithms for learning overcomplete dictionaries in the linear sparsity regime are based on the SOS method (Barak et al., 2015; Ma et al., 2016; Schramm & Steurer, 2017). We believe that our nonsmooth analysis has the potential of handling over-complete dictionaries, as for reasonably well-conditioned overcomplete dictionaries A, each ai (columns of A) makes a A approximately 1-sparse and so a i AX gives noisy estimate of a certain row of X. So the same formulation as Eq. (1.1) intuitively still works. We would like to leave that to future work. Nonsmooth phase retrieval and deep networks with Re LU mentioned in Section 1.1 are examples of many nonsmooth, nonconvex problems encountered in practice. Most existing theoretical results on these problems tend to be technically vague about handling the nonsmooth points: they either prescribe a rule for choosing a subgradient element, which effectively disconnects theory and practice because numerical testing of nonsmooth points is often not reliable, or ignore the nonsmooth points altogether, assuming that practically numerical methods would never touch these points this sounds intuitive but no formalism on this appears in the relevant literature yet. Besides our work, (Laurent & von Brecht, 2017; Kakade & Lee, 2018) also warns about potential problems of ignoring nonsmooth points when studying optimization of nonsmooth functions in machine learning. 3The e O( ) notation ignores the dependency on logarithmic terms and other factors. Published as a conference paper at ICLR 2019 Alekh Agarwal, Animashree Anandkumar, and Praneeth Netrapalli. A clustering approach to learning sparsely used overcomplete dictionaries. IEEE Trans. Information Theory, 63(1):575 592, 2017. Sanjeev Arora, Rong Ge, and Ankur Moitra. New algorithms for learning incoherent and overcomplete dictionaries. In Conference on Learning Theory, pp. 779 806, 2014. Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, efficient, and neural algorithms for sparse coding. 2015. Jean-Pierre Aubin. Optima and equilibria: an introduction to nonlinear analysis, volume 140. Springer Science & Business Media, 1998. Pranjal Awasthi and Aravindan Vijayaraghavan. Towards learning sparsely used dictionaries with arbitrary supports. ar Xiv preprint ar Xiv:1804.08603, 2018. Adil Bagirov, Napsu Karmitsa, and Marko M M akel a. Introduction to Nonsmooth Optimization: theory, practice and software. Springer, 2014. Boaz Barak, Jonathan A Kelner, and David Steurer. Dictionary learning and tensor decomposition via the sum-of-squares method. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pp. 143 151. ACM, 2015. St ephane Boucheron, Olivier Bousquet, and G abor Lugosi. Theory of classification: A survey of some recent advances. ESAIM: probability and statistics, 9:323 375, 2005. James V Burke, Adrian S Lewis, and Michael L Overton. A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM Journal on Optimization, 15(3):751 779, 2005. doi: 10.1137/030601296. James V. Burke, Frank E. Curtis, Adrian S. Lewis, Michael L. Overton, and Lucas E. A. Simes. Gradient sampling methods for nonsmooth optimization. ar Xiv:1804.11003, 2018. Emmanuel Cand es. Mathematics of sparsity (and a few other things). In Proceedings of the International Congress of Mathematicians, volume 123, 2014. Niladri Chatterji and Peter L Bartlett. Alternating minimization for dictionary learning with random initialization. In Advances in Neural Information Processing Systems, pp. 1997 2006, 2017. Frank H Clarke. Optimization and nonsmooth analysis. SIAM, 1990. doi: 10.1137/1.9781611971309. Frank E Curtis and Xiaocun Que. A quasi-newton algorithm for nonconvex, nonsmooth optimization with global convergence guarantees. Mathematical Programming Computation, 7(4):399 428, 2015. doi: 10.1007/s12532-015-0086-2. Frank E Curtis, Tim Mitchell, and Michael L Overton. A bfgs-sqp method for nonsmooth, nonconvex, constrained optimization and its evaluation using relative minimization profiles. Optimization Methods and Software, 32(1):148 181, 2017. Damek Davis, Dmitriy Drusvyatskiy, and Courtney Paquette. The nonsmooth landscape of phase retrieval. arxiv:1711.03247, 2017. URL http://arxiv.org/abs/1711.03247. John C. Duchi and Feng Ruan. Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval. arxiv:1705.02356, 2017. URL http://arxiv.org/abs/1705. 02356. Rong Ge, Jason D. Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. arxiv:1711.00501, 2017. URL http://arxiv.org/abs/1711.00501. Dar Gilboa, Sam Buchanan, and John Wright. Efficient dictionary learning with gradient descent. In ICML 2018 Workshop on Modern Trends in Nonconvex Optimization for Machine Learning, 2018. URL https://arxiv.org/abs/1809.10313. Published as a conference paper at ICLR 2019 P Grohs and S Hosseini. Nonsmooth trust region algorithms for locally Lipschitz functions on Riemannian manifolds. IMA Journal of Numerical Analysis, 36(3):1167 1192, 2015. doi: 10. 1093/imanum/drv043. Philipp Grohs and Seyedehsomayeh Hosseini. ε-subgradient algorithms for locally Lipschitz functions on Riemannian manifolds. Advances in Computational Mathematics, 42(2):333 360, 2016. doi: 10.1007/s10444-015-9426-z. Jean-Baptiste Hiriart-Urruty and Claude Lemarchal. Fundamentals of Convex Analysis. Springer Verlag Berlin Heidelberg, 2001. doi: 10.1007/978-3-642-56468-0. S Hosseini and MR Pouryayevali. Generalized gradients and characterization of epi-lipschitz sets in Riemannian manifolds. Nonlinear Analysis: Theory, Methods & Applications, 74(12):3884 3895, 2011. doi: 10.1016/j.na.2011.02.023. Seyedehsomayeh Hosseini. Optimality conditions for global minima of nonconvex functions on Riemannian manifolds. Pacific Journal of Optimization, 2015. URL http://uschmajew. ins.uni-bonn.de/research/pub/hosseini/3.pdf. Seyedehsomayeh Hosseini and Andr e Uschmajew. A Riemannian gradient sampling algorithm for nonsmooth optimization on manifolds. SIAM Journal on Optimization, 27(1):173 189, 2017. doi: 10.1137/16M1069298. Sham Kakade and Jason D. Lee. Provably correct automatic subdifferentiation for qualified programs. ar Xiv:1809.08530, 2018. Thomas Laurent and James von Brecht. The multilinear structure of relu networks. arxiv:1712.10132, 2017. URL http://arxiv.org/abs/1712.10132. Yu Ledyaev and Qiji Zhu. Nonsmooth analysis on smooth manifolds. Transactions of the American Mathematical Society, 359(8):3687 3732, 2007. doi: 10.1090/S0002-9947-07-04075-5. Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. ar Xiv:1808.01204, 2018. Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. arxiv:1705.09886, 2017. URL http://arxiv.org/abs/1705.09886. Po-Ling Loh and Martin J Wainwright. Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16:559 616, 2015. Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositions with sum-ofsquares. 2016. Julien Mairal, Francis Bach, and Jean Ponce. Sparse modeling for image and vision processing. Foundations and Trends R in Computer Graphics and Vision, 8(2-3):85 283, 2014. Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layers neural networks. ar Xiv:1804.06561, 2018. Ilya Molchanov. Foundations of stochastic geometry and theory of random sets. In Stochastic Geometry, Spatial Statistics and Random Fields, pp. 1 20. Springer, 2013. Bruno A Olshausen and David J Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607, 1996. Barrett O neill. Semi-Riemannian geometry with applications to relativity, volume 103. Academic press, 1983. Saiprasad Ravishankar and Yoram Bresler. Learning sparsifying transforms. IEEE Transactions on Signal Processing, 61(5):1072 1086, 2013. Tselil Schramm and David Steurer. Fast and robust tensor decomposition with applications to dictionary learning. ar Xiv:1706.08672, 2017. Published as a conference paper at ICLR 2019 Daniel A Spielman, Huan Wang, and John Wright. Exact recovery of sparsely-used dictionaries. In Conference on Learning Theory, pp. 37 1, 2012. Shlomo Sternberg. Dynamical Systems. Dover Publications, Inc, 2013. Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere. arxiv:1504.06785, 2015. URL http://arxiv.org/abs/1504.06785. Aad Van Der Vaart and Jon A Wellner. A note on bounds for VC dimensions. Institute of Mathematical Statistics collections, 5:103, 2009. Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. doi: 10.1017/9781108231596. Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. arxiv:1706.03175, 2017. URL http://arxiv.org/ abs/1706.03175. A TECHNICAL TOOLS A.1 HAUSDORFF DISTANCE We need the Hausdorff metric to measure differences between nonempty sets. For any set X and a point p in Rn, the point-to-set distance is defined as d (q, X) .= inf x X x p . (A.1) For any two sets X1, X2 Rn, the Hausdorff distance is defined as d H (X1, X2) .= sup sup x1 X1 d (x1, X2) , sup x2 X2 d (x2, X1) . (A.2) When X1 is a singleton, say X1 = {p}. Then d H ({p} , X2) = sup x2 X2 x2 p . (A.3) Moreover, for any sets X1, X2, Y1, Y2 Rn, d H (X1 + Y1, X2 + Y2) d H (X1, X2) + d H (Y1, Y2) . (A.4) On the sets of nonempty, compact subsets of Rn, the Hausdorff metric is a valid metric; particularly, it obeys the triangular inequality: for nonempty, compact subsets X, Y, Z Rn, d H (X, Z) d H (X, Y ) + d H (Y, Z) . (A.5) See, e.g., Sec. 7.1 of Sternberg (2013) for a proof. Lemma A.1 (Restatement of Lemma A.1). For convex compact sets X, Y Rn, we have d H (X, Y ) = sup u Sn 1 |h X(u) h Y (u)| , (A.6) where h S(u) .= supx S x, u is the support function associated with the set S. A.2 SUB-GAUSSIAN RANDOM MATRICES AND PROCESSES Proposition A.2 (Talagrand s comparison inequality, Corollary 8.6.3 and Exercise 8.6.5 of Vershynin (2018)). Let {Xx}x T be a zero-mean random process on a subset T Rn. Assume that for all x, y T we have Xx Xy ψ2 K x y . (A.7) Then, for any t > 0 sup x T |Xx| CK [w(T) + t rad(T)] (A.8) with probability at least 1 2 exp t2 . Here w(T) .= Eg N(0,I) supx T x, g is the Gaussian width of T and rad(T) = supx T x is the radius of T. Published as a conference paper at ICLR 2019 Proposition A.3 (Deviation inequality for sub-Gaussian matrices, Theorem 9.1.1 and Exercise 9.1.8 of Vershynin (2018)). Let A be an n m matrix whose rows Ai s are independent, isotropic, and sub-Gaussian random vectors in Rm. Then for any subset T Rm, we have Ax n x > CK2 [w (T) + t rad (T)] 2 exp t2 . (A.9) Here K = maxi Ai ψ2. B PROOFS FOR SECTION 3.1 B.1 PROOF OF PROPOSITION 3.2 E [f] (q) = rπ 2 E q x = rπ 2 E q ΩxΩ = E [ qΩ 2] , (B.1) where the last equality is obtained by conditioning on Ωand the fact that EZ N(0,σ2)[|Z|] = p The subdifferential expression comes from qΩ qΩ 2 , qΩ = 0; {vΩ: vΩ 2 1} , qΩ= 0, (B.2) and the fact that E [f] (q) = E [ qΩ 2] = E [ qΩ 2] as the sub-differential and expectation can be exchanged for convex functions (Hiriart-Urruty & Lemarchal, 2001). By the same exchangability result, we also have E [ f(q)] = E [f] (q). B.2 PROOF OF PROPOSITION 3.3 We first show that points in the claimed set are indeed stationary points by taking the choice vΩ= 0 in Eq. (3.5), giving the subgradient choice E [ f] (q) = E [qΩ/ qΩ 2 1 {qΩ = 0}]. Let q S and such that q 0 = k. For all j supp(q), we have e T j E [ f] (q) = θqj EΩ 1 qΩ 1{j Ω} (B.3) i=1 θi 1(1 θ)k i i=1 θi(1 θ)k i 1 i .= c(θ, k)qj (B.5) On the other hand, for all j / supp(q), we always have [qΩ]j = 0, so e j E [ f] (q) = 0. Therefore, we have that E [ f] (q) = c(θ, k)q, and so (I qq )E [ f] (q) = c(θ, k)q c(θ, k)q = 0. (B.6) Therefore q S is stationary. To see that { ei : i [n]} are the global minima, note that for all q Sn 1, we have E [f] (q) = E [ qΩ 2] E h qΩ 2 2 i = θ. (B.7) Equality holds if and only if qΩ 2 {0, 1} almost surely, which is only satisfied at q { ei : i [n]}. To see that the other q s are saddles, we only need to show that there exists a tangent direction along which q is local max. Indeed, for any other q, there exists at least two non-zero entries (with equal absolute value): WLOG assume that q1 = qn > 0. Using the reparametrization in Appendix B.3 and applying Lemma B.2, we get that E [f] (q) is directionally differentiable along [ q n; 1 q2 n qn ], with derivative zero (necessarily, because 0 E [ Rf] (q)) and strictly negative second derivative. Published as a conference paper at ICLR 2019 Therefore E [f] (q) is locally maximized at q along this tangent direction, which shows that q is a saddle point. The other direction (all other points are not stationary) is implied by Theorem 3.4, which guarantees that 0 / E [ Rf] (q) whenever q / S. Indeed, as long as q / S, q has a max absolute value coordinate (say n) and another non-zero coordinate with strictly smaller absolute value (say j). For this pair of indices, the proof of Theorem 3.4(a) goes through for index j (even if q S(n+) ζ0 does not necessarily hold because the max index might not be unique), which implies that 0 / E [ Rf] (q). B.3 REPARAMETRIZATION For analysis purposes, we introduce the reparametrization w = q1:(n 1) in the region S(n+) 0 , following (Sun et al., 2015) . With this reparametrization, the problem becomes minimizew Rn 1 g (w) .= rπ 1 subject to w The constraint comes from the fact that qn 1/ n and thus w p Lemma B.1. We have Ex iid BG(θ)g (w) = (1 θ) EΩ wΩ + θEΩ q 1 wΩc 2. (B.9) Proof. Direct calculation gives Ex iid BG(θ)g (w) (B.10) = EΩ iid Ber(θ),ω Ber(θ)Ez iid N(0,1),z N(0,1) 1 w 2 ([Ω; ω] [z; z]) = (1 θ) EΩ iid Ber(θ)Ez iid N(0,1) w Ωz + θEΩ iid Ber(θ)Ez iid N(0,1),z N(0,1) 1 w 2z (B.12) = (1 θ) EΩ iid Ber(θ) wΩ + θEΩ iid Ber(θ) q 1 wΩc 2, (B.13) as claimed. Lemma B.2 (Negative-curvature region). For all unit vector v Sn 1 and all s (0, 1), let hv (s) .= E [g] (sv) (B.14) it holds that 2hv (s) θ (1 θ) . (B.15) In other words, for all w = 0, w/ w is a direction of negative curvature. Proof. By Lemma B.1, hv (s) = (1 θ) s EΩ vΩ + θEΩ q 1 s2 vΩc 2. (B.16) For s (0, 1), hv(s) is twice differentiable, and we have 2hv(s) = θEΩ vΩc 2 1 s2 vΩc 2 3/2 (B.17) θEΩ vΩc 2 = θ (1 θ) , (B.18) completing the proof. Published as a conference paper at ICLR 2019 Lemma B.3 (Inward gradient). For any w with w 2 + w 2 1, Dc w/ w E [g] (w) θ (1 θ) 1/ q 1 + w 2 / w 2 w . (B.19) Proof. For any unit vector v Rn 1, define hv (t) .= E [g] (tv) for t (0, 1). We have from Lemma B.1 hv (t) = (1 θ) t EΩ vΩ + θEΩ q 1 t2 vc Ω 2. (B.20) thv (t) = (1 θ) EΩ vΩ θEΩ t vc Ω 2 q 1 t2 vc Ω 2 (B.21) = (1 θ) EΩ vΩ 2 vΩ θEΩ t vc Ω 2 q 1 t2 vc Ω 2 (assuming 0 0 .= 0) (B.22) i=1 EΩ v2 i 1 {i Ω} q v2 i 1 {i Ω} + vΩ\i 2 θ i=1 EΩ tv2 i 1 {i / Ω} q 1 t2v2 i 1 {i / Ω} t2 vΩc\i 2 v2 i + vΩ\i 2 tv2 i q 1 t2v2 i t2 vΩc\i 2 = θ (1 θ) t i=1 v2 i EΩ t2v2 i + t2 vΩ\i 2 1 q 1 t2v2 i t2 vΩc\i 2 = θ (1 θ) t i=1 v2 i EΩ t2v2 i + t2 vΩ\i 2 1 q 1 t2 v 2+t2 vΩ\i 2 We are interested in the regime of t so that 1 t2 v 2 t2 v 2 = t 1/ q 1 + v 2 . (B.27) So thv (t) 0 holds always for t 1/ q By Lemma B.2, 2hv (t) θ (1 θ) over t (0, 1), which implies thv (t1) thv (t2) , t1 t2 θ (1 θ) (t1 t2)2 . (B.28) Taking t1 = 1/ q 1 + v 2 and considering t2 [0, t1], we have thv (t2) thv (t1) + θ (1 θ) (t1 t2) θ (1 θ) 1/ q For any w, applying the above result to the unit vector w/ w and recognizing that thw/ w (t) = Dw/ w g (w) = Dc w/ w g (w), we complete the proof. B.4 PROOF OF THEOREM 3.4 We first show Eq. (3.9) using the reparametrization in Appendix B.3. We have Rf (q) , q en = f (q) , qnq en = qn g (w) , w , (B.30) Published as a conference paper at ICLR 2019 where the second equality follows by differentiating g via the chain rule. Now, by Lemma B.3, qn E [ g] (w) , w w θ (1 θ) qn For each radial direction v .= w/ w , consider points of the form tv with t 1/ q 1 + v 2 . Obviously, the function h (t) .= qn (tv) tv 2 + tv 2 is monotonically decreasing wrt t. Thus, to derive a lower bound, it is enough to consider the largest t allowed. In S(n+) ζ0 , the limit amounts to requiring q2 n/ w 2 = 1 + ζ0, 1 t2 0 = t2 0 v 2 (1 + ζ0) = t0 = 1 q 1 + (1 + ζ0) v 2 So for any fixed v and all allowed t for points in S(n+) ζ0 , a uniform lower bound is 1 + (1 + ζ0) v 2 1 8 nζ0 v 2 1 8ζ0n 3/2. (B.35) So we conclude that for all q S(n+) ζ0 , E [ f] (q) , qnq en 1 8θ (1 θ) ζ0n 3/2 w = 1 8θ (1 θ) ζ0n 3/2 q n . (B.36) We now turn to showing Eq. (3.8). For ej with qj = 0, 1 qj e j E [ f] (q) = 1 qΩ 1 {qΩ = 0} + {vΩ: vΩ 1} 1 {qΩ= 0} qΩ 1 {qΩ = 0} = 1 1 qΩ 1 {j Ω} = θE q2 j + qΩ\{j} 2 So for all j with qj = 0, we have E [ Rf] (q) , 1 = I qq E [ f] (q) , 1 = E [ f] (q) , 1 q2 j + qΩ\{j} 2 q2n + qΩ\{n} 2 q2 j + q2n + qΩ\{j,n} 2 + θ (1 θ) EΩ q2 j + qΩ\{j,n} 2 Published as a conference paper at ICLR 2019 q2 j + q2n + qΩ\{j,n} 2 q2n + qΩ\{j,n} 2 = θ (1 θ) EΩ q2 j + qΩ\{j,n} 2 1 q q2n + qΩ\{j,n} 2 = θ (1 θ) EΩ 2 t + qΩ\{j,n} 2 3/2 dt (B.43) 2 q2 n q2 j 1 2θ (1 θ) q2 n q n 2 1 2θ (1 θ) ζ0 1 + ζ0 q2 n (B.44) 2nθ (1 θ) ζ0 1 + ζ0 (B.45) completing the proof. C PROOFS FOR SECTION 3.2 C.1 COVERING IN THE d E METRIC For any θ (0, 1), define d E,θ (p, q) .= E 1 sign p x = sign q x with x iid BG(θ). (C.1) We stress that this notion always depend on θ, and we will omit the subscript θ when no confusion is expected. This indeed defines a metric on subsets of Sn 1. Lemma C.1. Over any subset of Sn 1 with a consistent support pattern, d E is a valid metric. Proof. Recall that (x, y) .= arccos x, y defines a valid metric on Sn 1.4 In particular, the triangular inequality holds. For d E and p, q Sn 1 with the same support pattern, we have d E (p, q) = E 1 sign p x = sign q x (C.2) = EΩEz N(0,I)1 sign p Ωz = sign q Ωz (C.3) = EΩ Ez1 p ΩzqΩz < 0 + Ez1 p Ωz = 0 or q Ωz = 0, not both (C.4) = EΩ Ez1 p ΩzqΩz < 0 (C.5) π EΩ (pΩ, qΩ) , (C.6) where we have adopted the convention that (0, v) .= 0 for any v. It is easy to verify that d E (p, q) = 0 p = q, and d E (p, q) = d E (q, p). To show the triangular inequality, note that for any p, q and r with the same support pattern, pΩ, qΩ, and rΩare either identically zero, or all nonzero. For the former case, (pΩ, qΩ) (pΩ, rΩ) + (qΩ, rΩ) (C.7) holds trivially. For the latter, since ( , ) obeys the triangular inequality uniformly over the sphere, which implies (pΩ, qΩ) (pΩ, rΩ) + (qΩ, rΩ) . (C.9) 4This fact can be proved either directly, see, e.g., page 12 of this online notes: http://www.math. mcgill.ca/drury/notes354.pdf, or by realizing that the angle equal to the geodesic length, which is the Riemmannian distance over the sphere; see, e.g., Riemannian Distance of Chapter 5 of the book O neill (1983). Published as a conference paper at ICLR 2019 So EΩ (pΩ, qΩ) EΩ (pΩ, rΩ) + EΩ (qΩ, rΩ) , (C.10) completing the proof. Lemma C.2 (Vector angle inequality). For n 2, consider u, v Rn so that (u, v) π/2. It holds that Ω ( [n] 2 ) (uΩ, vΩ) . (C.11) Proof. The inequality holds trivially when either of u, v is zero. Suppose they are both nonzero and wlog assume both are normalized, i.e., u = v = 1. Then, sin2 (u, v) = 1 cos2 (u, v) (C.12) = u 2 v 2 u, v 2 (C.13) i,j:j>i (uivj ujvi)2 (Lagrange s identity) (C.14) Ω ( [n] 2 ) uΩ 2 vΩ 2 uΩ, vΩ 2 (C.15) Ω ( [n] 2 ) uΩ 2 vΩ 2 sin2 (uΩ, vΩ) (C.16) Ω ( [n] 2 ) sin2 (uΩ, vΩ) . (C.17) Ω ( [n] 2 ) (uΩ, vΩ) > π/2, the claimed inequality holds trivially, as (u, v) π/2 by our assumption. Suppose P Ω ( [n] 2 ) (uΩ, vΩ) π/2. Then, X Ω ( [n] 2 ) sin2 (uΩ, vΩ) sin2 X Ω ( [n] 2 ) (uΩ, vΩ) (C.18) by recursive application of the following inequality: θ1, θ2 [0, π/2] with θ1 + θ2 π/2, sin2 (θ1 + θ2) = sin2 θ1 + sin2 θ2 + 2 sin θ1 sin θ2 cos (θ1 + θ2) sin2 θ1 + sin2 θ2. (C.19) So we have that when P Ω ( [n] 2 ) (uΩ, vΩ) π/2, sin2 (u, v) sin2 X Ω ( [n] 2 ) (uΩ, vΩ) = (u, v) X Ω ( [n] 2 ) (uΩ, vΩ) , (C.20) as claimed. Lemma C.3 (Covering in maximum length-2 angles). For any η (0, 1/3), there exists a subset Q Sn 1 of size at most (5n log(1/η)/η)2n 1 satisfying the following: for any p Sn 1, there exists some q Q such that (pΩ, qΩ) η for all Ω [n] with |Ω| 2. Proof. Define d2(p, q) = max |Ω| 2 (pΩ, qΩ) , (C.21) our goal is to give an η-covering of Sn 1 in the d2 metric. Step 1 We partition Sn 1 according to the support, the sign pattern, and the ordering of the non-zero elements. For each configuration, we are going to construct a covering with the same configuration of support, sign pattern, and ordering. There are no more than 3n n! such configurations. Note that we only need to construct one such covering for each support size, and for each support size we can ignore the zero entries the angle (pΩ, qΩ) is always zero when p, q have matching support and Ω contains at least one zero index. Therefore, the task reduces to bounding the covering number of An = p Sn 1 : p 0, 0 < p1 p2 . . . pn (C.22) in d2 for all n. Published as a conference paper at ICLR 2019 Step 2 We bound the covering number of An by induction. Suppose that N(An ) 5 log(1/η) η n n 1 = (Cηn )n 1 (C.23) holds for all n n 1. (The base case m = 2 clearly holds.) Let Cn Sn 1 be the correpsonding covering sets. We now construct a covering for An. Let R .= 1/η = rk for some r 1 and k to be determined. Consider the set Qr,k = q Sn 1 : qi+1 qi 1, r, r2, . . . , rk 1 for all 1 i n 1 . (C.24) We claim that Qr,k with properly chosen (r, k) gives a covering of An,R = p An : pi+1 pi R, i An. (C.25) Indeed, we can decompose [1, R) into [1, r), [r, r2), . . . , [rk 1, R). Each consecutive ratio pi+1/pi falls in one of these intervals, and we choose q so that qi+1/qi is the left endpoint of this interval. Such a q satisfies q Qr,k and qi+1/qi [1, r) for all i [n 1]. (C.26) By multiplying these bounds, we obtain that for all 1 i < j n, qj/qi [1, rn 1). (C.27) Take r = 1 + η/2n, we have rn 1 = (1 + η/2n)n 1 exp(η/2) 1 + η. Therefore, for all i, j, we have pj/pi qj/qi [1, 1 + η), which further implies that ((pi, pj), (qi, qj)) η by Lemma F.4. Thus we have for all |Ω| 2 that (pΩ, qΩ) η. (The size-1 angles are all zero as we have sign match.) For this choice of r, we have k = log R/log r and thus |Qr,k|= kn 1 = log R n 1 = log(1/η) log(1 + η/(2n)) n 1 4n log(1/η) n 1 .= e Nn, (C.28) and we have N(An,R) e Nn. Step 3 We now construct the covering of An \ An,R. For any p An \ An,R, there exists some i such that pi+1/pi [R, ), which means that the angle of the ray (pi, pi+1) is in between [arctan(R), π/2) = [π/2 η, π/2). As p is sorted, we have that pi l R for all j 1, l 0, (C.29) So if we take q such that qi+1/qi [R, ), q also has the above property, which gives that ((pi l, pi+j), (qi l, qi+j)) π/2 (π/2 η) = η for all j 1, l 0. (C.30) Therefore to obtain the cover in d2, we only need to consider the angles for Ω {1, . . . , i} and Ω {i + 1, . . . , n}, which can be done by taking the product of the covers in Ai and An i. By considering all i {1, . . . , n 1}, we obtain the bound N(An \ An,R) i=1 N(Ai)N(An i). (C.31) Published as a conference paper at ICLR 2019 Step 4 Putting together Step 2 and Step 3 and using the inductive assumption, we get that N(An) N(An,R) + N(An \ An,R) e Nn + i=1 N(Ai)N(An i) (C.32) 4n log(1/η) i=1 (Cηi)i 1(Cη(n i))n i 1 (C.33) n 1 (Cηn)n 1 + (n 1) Cn 2 η nn 2 (C.34) (Cηn)n 1 (Cηn)n 1. (C.35) This shows the case for m = n and completes the induction. Step 5 Considering all configurations of {support, sign pattern, ordering}, we have N(Sn 1) 3n n! N(An) (3n)n 5 log(1/η) η n n 1 5 log(1/η) η n 2n 1 . (C.36) Lemma C.4 (Covering number in the d E metric). Assume n 3. There exists a numerical constant C > 0 such that for any ϵ (0, 1), Sn 1 admits an ϵ-net of size exp(Cn log n ϵ ) w.r.t. d E defined in Eq. (C.1): for any p Sn 1, there exists a q in the net with supp (q) = supp (p) and d E (p, q) ϵ. We say such ϵ nets are admissible for Sn 1 wrt d E. Proof. Let η = ϵ/n2. By Lemma C.3, there exists a subset Q Sn 1 of size at most 5n log(1/η) 2n 1 = 5n3 log(n2/ϵ) 2n 1 exp Cn log n such that for any p Sn 1, there exists q Sn 1 such that supp(p) = supp(q) and (pΩ, qΩ) η for all |Ω| 2. In particular, the |Ω|= 1 case says that sign(p) = sign(q), which implies that (pΩ, qΩ) π/2 Ω {0, 1}n . (C.38) Thus, applying the vector angle inequality (Lemma C.2), for any p Sn 1 and the corresponding q Q, we have (pΩ , qΩ ) 2 |Ω| 2 η |Ω|2η Ωwith 3 |Ω| n. (C.39) Summing up, we get (pΩ, qΩ) max 2, |Ω|2 η n2η = ϵ Ω. (C.40) Therefore d E(p, q) ϵ. Below we establish the Lipschitz property in terms of d E distance. Lemma C.5. Fix θ (0, 1). For any ϵ (0, 1), let Nϵ be an admissible ϵ-net for Sn 1 wrt d E. Let x1, . . . , xm be iid copies of x iid BG(θ) in Rn. When m Cϵ 2n, the inequality sup p Sn 1,q Nϵ supp(p)=supp(q), d E(p,q) ϵ R (p, q) .= 1 i=1 1 sign p xi = sign q xi 2ϵ (C.41) holds with probability at least 1 exp cϵ2m . Here C and c are universal constants independent of ϵ. Published as a conference paper at ICLR 2019 Proof. We call any pair of p, q Sn 1 with q Nϵ, supp (p) = supp (q), and d E (p, q) ϵ an admissible pair. Over any admissible pair (p, q), E [R] = d E (p, q). We next bound the deviation R E [R] uniformly over all admissible (p, q) pairs. Observe that the process R is the sample average of m indicator functions. Define the hypothesis class H = x 7 1 sign p x = sign q x : (p, q) is an admissible pair . (C.42) and let dvc(H) be the VC-dimension of H. From concentration results for VC-classes (see, e.g., Eq (3) and Theorem 3.4 of Boucheron et al. (2005)), we have sup (p,q) admissible {R(p, q) E [R] (p, q)} C0 exp( mt2) (C.43) for any t > 0. It remains to bound the VC-dimension dvc(H). First, we have dvc (H) dvc x 7 1 sign p x = sign q x : p, q Sn 1 . (C.44) Observe that each set in the latter hypothesis class can be written as x 7 1 sign p x = sign q x : p, q Sn 1 = x 7 1 p x > 0, q x 0 : p, q Sn 1 x 7 1 p x 0, q x < 0 : p, q Sn 1 x 7 1 p x < 0, q x 0 : p, q Sn 1 x 7 1 p x 0, q x > 0 : p, q Sn 1 . (C.46) the union of intersections of two halfspaces. Thus, letting H0 = x 7 1 x z 0 : z Rn (C.47) be the class of halfspaces, we have H (H0 H0) (H0 H0) (H0 H0) (H0 H0). (C.48) Note that H0 has VC-dimension n + 1. Applying bounds on the VC-dimension of unions and intersections (Theorem 1.1, Van Der Vaart & Wellner (2009)), we get that dvc(H) C1dvc(H0 H0) C2dvc(H0) C3n. (C.49) Plugging this bound into Eq. (C.43), we can set t = ϵ/2 and make m large enough so that C0 C3 p n/m ϵ/2, completing the proof. C.2 POINTWISE CONVERGENCE OF SUB-DIFFERENTIAL Proposition C.6 (Pointwise convergence). For any fixed q Sn 1, P h d H ( f (q) , E [ f] (q)) > Ca p n/m + Cbt/ m i 2 exp t2 t > 0. (C.50) Here Ca, Cb 0 are universal constants. Proof. Recall that d H ( f (q) , E [ f (q)]) = sup u Sn 1 h f(q) (u) h E f(q) (u) = sup u Sn 1 h f(q) (u) Eh f(q) (u) . Write Xu .= h f(q) (u) Eh f(q) (u) and consider the zero-mean random process {Xu} defined on Sn 1. For any u, v Sn 1, we have Xu Xv ψ2 = h f(q) (u) Eh f(q) (u) h f(q) (v) + Eh f(q) (v) ψ2 (C.52) i [m] (h Qi (u) Eh Qi (u) h Qi (v) + Eh Qi (v)) Published as a conference paper at ICLR 2019 i [m] h Qi (u) Eh Qi (u) h Qi (v) + Eh Qi (v) 2 ψ2 i [m] h Qi (u) h Qi (v) 2 ψ2 (centering), (C.55) where we write Qi .= sign q xi xi for all i [m]. Next we estimate h Qi (u) h Qi (v) ψ2. By definition, h Qi (u) h Qi (v) = sup z Qi z, u sup z Qi z , v . (C.56) If h Qi (u) h Qi (v) 0 and let z .= arg maxz Qi z, u , we have h Qi (u) h Qi (v) z , u z , v = z , u v , (C.57) h Qi (u) h Qi (v) ψ2 z , u v ψ2 x i (u v) ψ2 C3 u v , (C.58) where we have used Lemma F.1 to obtain the last upper bound. If h Qi (u) h Qi (v) 0, h Qi (v) h Qi (u) 0 and we can use similar argument to conclude that h Qi (u) h Qi (v) ψ2 C3 u v . (C.59) Xu Xv ψ2 C4 m u v . (C.60) Thus, {Xu} is a centered random process with sub-Gaussian increments with a parameter C4/ m. We can apply Proposition A.2 to conclude that P sup u Sn 1 h f(q) (u) Eh f(q) (u) > C5 p n/m + C6t/ m 2 exp t2 t > 0, which implies the claimed result. C.3 PROOF OF PROPOSITION 3.5 (UNIFORM CONVERGENCE) Throughout the proof, we let c, C denote universal constants that could change from step to step. Fix an ϵ (0, 1/2) to be decided later. Let Nϵ be an admissible ϵ net for Sn 1 wrt d E, with |Nϵ| exp(Cn log(n/ϵ)) (Lemma C.4). By Proposition C.6 and the union bound, P [ q Nϵ, d H ( f (q) , E [ f] (q)) > t/3] exp cmt2 + Cn log n provided that m Ct 2n. For any p Sn 1, let q Nϵ satisfy supp (q) = supp (p) and d E (p, q) ϵ. Then we have d H ( f (p) , E [ f] (p)) d H ( f (q) , E [ f] (q)) | {z } I + d H (E [ f] (p) , E [ f] (q)) | {z } II + d H ( f (p) , f (q)) | {z } III (C.63) by the triangular inequality for the Hausdorff metric. By the preceding union bound, term I is bounded by t/3 as long as the bad event does not happen. For term II, we have d H (E [ f] (p) , E [ f] (q)) = sup u Sn 1 h E[ f](p) (u) h E[ f](q) (u) (C.64) Published as a conference paper at ICLR 2019 = sup u Sn 1 E h f(p) (u) h f(q) (u) (C.65) 2 sup u Sn 1 E sup sign p xi x i u sup sign q xi x i u (C.66) 2 sup u Sn 1 E |x i u|1 sign p xi = sign q xi (C.67) where the last line follows from Lemma F.5. As long as ϵ ct/ p log(1/t), the above term is upper bounded by t/3. For term III, we have d H ( f (p) , f (q)) = sup u Sn 1 h f(p) (u) h f(q) (u) (C.69) m sup u Sn 1 i [m]:sign(p xi) =sign(q xi) sup sign p xi x i u sup sign q xi x i u m sup u Sn 1 i [m]:sign(p xi) =sign(q xi) six i u (si {+1, 1} dependent on p, q, xi and u) i [m]:sign(p xi) =sign(q xi) sixi By Lemma C.5, with probability at least 1 exp( cϵ2m), the number of different signs is upper bounded by 2mϵ for all p, q such that d E(p, q) ϵ. On this good event, the above quantity can be upper bounded as follows. Define a set T .= {s Rm : si {+1, 1, 0} , s 0 2mϵ} and consider the quantity sups T Xs , where X = [x1, . . . , xm]. Then, i [m]:sign(p xi) =sign(q xi) sixi sup s T Xs (C.73) uniformly (i.e., indepdent of p, q and u). We have w (T) = E sup s T s g = E sup K [m],|K| 2mϵ i K |gi| (C.74) 2mϵE g 4m p log mϵ, (here g N (0, Im)) (C.75) 2mϵ. (C.76) Noting that 1/ θ X has independent, isotropic, and sub-Gaussian rows with a parameter C/ θ, we apply Proposition A.3 and obtain that log m ϵ + t0 with probability at least 1 2 exp t2 0 . So we have over all admissible (p, q) pairs, d H ( f (p) , f (q)) rπ log m ϵ + t0 Published as a conference paper at ICLR 2019 Setting t0 = ct m and ϵ = ct p θ/log m, we have that d H ( f (p) , f (q)) t provided that m Cϵt 2n = Ct 1n p θ/log m, which is subsumed by the earlier requirement m Ct 2n. Putting together the three bounds Eq. (C.62), Eq. (C.67), Eq. (C.80), we can choose θ log(m/t) ct min θ log m, 1 p and get that d H ( f(p), E [ f] (p)) t with probability at least 1 2 exp cmt2 exp( cmϵ2) exp cmt2 + Cn log n 1 2 exp( cmt2) exp cmθt2 exp cmt2 + Cn log n log(m/t) 1 exp cmθt2 provided that m Cnt 2 log n log(m/t) θt . A sufficient condition is that m Cnt 2 log2(n/t) for sufficiently large C. When this is satisfied, the probability is further lower bounded by 1 exp( cmθt2/log m). C.4 PROOF OF THEOREM 3.6 t = 1 32n3/2 θ(1 θ)ζ0 min ( 1 8n3/2 θ(1 θ) ζ0 1 + ζ0 , 2 2 16n3/2 θ(1 θ)ζ0 By Proposition 3.5, with probability at least 1 exp cmθ3ζ2 0n 3 log 1 m we have d H (E [ f] (q) , f (q)) t, (C.86) provided that m Cn4θ 2ζ 2 0 log (n/ζ0). We now show the properties Eq. (3.12) and Eq. (3.13) on this good event, focusing on S(n+) ζ0 but obtaining the same results for all other 2n 1 subsets by the same arguments. For Eq. (3.12), we have Rf (q) , ej/qj en/qn = f (q) , I qq (ej/qj en/qn) = f (q) , ej/qj en/qn . (C.87) sup f (q) , en/qn ej/qj = h f(q) (en/qn ej/qj) (C.88) = Eh f(q) (en/qn ej/qj) Eh f(q) (en/qn ej/qj) + h f(q) (en/qn ej/qj) (C.89) Eh f(q) (en/qn ej/qj) + en/qn ej/qj sup u Sn 1 Eh f(q) (u) h f(q) (u) (C.90) = sup E [ f] (q) , en/qn ej/qj + en/qn ej/qj d H (E [ f] (q) , f (q)) . (C.91) By Theorem 3.4(a), sup E [ f] (q) , en qnq 1 2nθ (1 θ) ζ0 1 + ζ0 . (C.92) Published as a conference paper at ICLR 2019 Moreover, en/qn ej/qj = q 1/q2n + 1/q2 j p 1/q2n + 3/q2n 2 n. Meanwhile, we have d H (E [ f] (q) , f (q)) t 1 8n3/2 θ (1 θ) ζ0 1 + ζ0 . (C.93) We conclude that inf f (q) , ej/qj en/qn = sup f (q) , en/qn ej/qj (C.94) 2nθ (1 θ) ζ0 1 + ζ0 2 n 1 4nθ (1 θ) ζ0 1 + ζ0 1 2 n (C.95) 4nθ (1 θ) ζ0 1 + ζ0 , (C.96) as claimed. For Eq. (3.13), we have by Theorem 3.4(b) that sup f (q) , en qnq = h f(q) (en qnq) (C.97) = Eh f(q) (en qnq) Eh f(q) (en qnq) + h f(q) (en qnq) (C.98) Eh f(q) (en qnq) + en qnq sup u Sn 1 Eh f(q) (u) h f(q) (u) = sup E [ f] (q) , en qnq + q n d H (E [ f] (q) , f (q)) . (C.100) As we are on the good event d H (E [ f] (q) , f (q)) t 2 2 16n3/2 θ (1 θ) ζ0, (C.101) inf f (q) , qnq en = sup f (q) , en qnq (C.102) 8θ (1 θ) ζ0n 3/2 w q n 2 2 16 θ (1 θ) ζ0n 3/2 2 16 θ (1 θ) ζ0n 3/2 q n . (C.104) Noting that q n 1 2 q en for all q with qn 0 completes the proof. C.5 PROOF OF PROPOSITION 3.7 For any q Sn 1, sup f (q) = d H ({0} , f (q)) d H ({0} , E [ f] (q)) + d H ( f (q) , E [ f] (q)) (C.105) by the metric property of the Hausdorff metric. On one hand, we have sup E [ f] (q) = sup EΩ qΩ 1 {qΩ = 0} + {vΩ: vΩ 1} 1 {qΩ= 0} 1. On the other hand, by Proposition 3.5, d H ( f (q) , E [ f] (q)) 1 q Sn 1 (C.107) with probability at least 1 exp c1mθ log 1 m , provided that m C2n2 log n (simplified using θ 1/n). Combining the two results complete the proof. Published as a conference paper at ICLR 2019 C.6 ADDITIONAL GEOMETRIES ON THE EMPIRICAL OBJECTIVE Proposition C.7. On the good event in Proposition 3.7, for all q S(n+) ζ0 , we have f (q) f (en) 2 n q en . (C.108) Proof. We use Lebourg s mean value theorem for locally Lipschitz functions5, i.e., Theorem 2.3.7 of (Clarke, 1990). It is convenient to work in the w space here. By subdifferential chain rules, g (w) is locally Lipschitz over n w : w < q n o . Thus, we have f (q) f (en) = g (w) g (0) = v, w (C.109) for a certain t0 (0, 1) and a certain v g (t0w). Now for any q and the corresponding w, g (w) , w = 1 qn Rf (q) , qnq en . (C.110) t0 sup g (t0w) , t0w = 1 1 qn (t0w) sup Rf (q (t0w)) , qn (t0w) q (t0w) en sup Rf (q (t0w)) qn (t0w) q (t0w) en t0qn (t0w) 2 qn (t0w) q (t0w) en t0qn (t0w) , (C.111) where at the last inequality we have used Proposition 3.7. Continuing the calculation, we further have qn (t0w) q (t0w) en t0qn (t0w) = t0 w t0qn (t0w) n w n q en , (C.112) completing the proof. Proposition C.8. Assume θ [1/n, 1/2]. When m Cθ 2n log n, with probability at least 1 exp cmθ3 log 1 m , the following holds: for all q S(n+) ζ0 satisfying f (q) f (en) 2 25θ, f (q) f (en) 2 16 θ (1 θ) q n 1 16θ (1 θ) q en . (C.113) Here C, c > 0 are universal constants. Proof. We first establish uniform convergence of f (p) to E [f] (p). Consider the zero-centered random process Xp .= f (p) E [f] (p) on Sn 1. Similar to proof of Proposition C.6, we can show that for all p, q Sn 1 Xp Xq ψ2 C m p q . (C.114) Applying Proposition A.2 gives that f (p) E [f] (q) 1 100θ q Sn 1 (C.115) with probability at least 1 exp cmθ2 , provided that m Cθ 2n. Now we consider E [f] (q) E [f] (en). For convenience, we first work in the w space and note that E [f] (q) E [f] (en) = E [g] (w (q)) E [g] (0). By Lemma B.3, E [g] is monotonically increasing in every radial direction v until w 2 + w 2 1, which implies that inf w 1/2 E [g] (w (q)) E [g] (0) = inf w =1/2 E [g] (w (q)) E [g] (0) . (C.116) 5It is possible to directly apply the manifold version of Lebourg s mean value theorem, i.e., Theorem 3.3 of Hosseini & Pouryayevali (2011). We avoid this technicality by working with the Euclidean version in w space. Published as a conference paper at ICLR 2019 For w with w = 1/2, E [g] (w) E [g] (0) = (1 θ) EΩ wΩ + θEΩ q 1 wΩc 2 θ (Lemma B.1) (C.117) (1 θ) θ w + θEΩ q 1 w 2 θ (C.118) 3 2 θ θ (using θ 1/2 and w = 1/2) (C.119) 10θ. (C.120) So, back to the q space, inf q S(n+) ζ0 : q n 1/2 E [f] (q) E [f] (0) 1 10θ. (C.121) Combining the results in Eq. (C.115) and Eq. (C.121), we conclude that with high probability inf q S(n+) ζ0 : q n 1/2 f (q) f (0) 2 25θ. (C.122) So when f (q) f (0) 2/25 θ, q n 1/2, which is equivalent to w 1/2 in the w space. Under this constraint, by Lemma B.3, Dc w/ w E [g] (w) θ (1 θ) 1/ q 1 + w 2 / w 2 w (C.123) 5θ (1 θ) . (C.124) So, emulating the proof of Eq. (3.9) in Theorem 3.4, we have that for q S(n+) ζ0 with q n 1/2, E [ Rf] (q) , qnq en = qn E [ g] (w) , w qn w 1 3 10 θ (1 θ) w , where at the last inequality we use qn = q 3/2 when w 1/2. Moreover, we emulate the proof of Eq. (3.13) in Theorem 3.6 to obtain that inf Rf (q) , qnq en 2 16 θ (1 θ) q n 1 16θ (1 θ) q en (C.126) with probability at least 1 exp cmθ3 log 1 m , provided that m Cθ 2n log n. The last step of our proof is invoking the mean value theorem, similar to the proof of Proposition C.7. For any q, we have f (q) f (en) = g (w) g (0) = v, w (C.127) for a certain t (0, 1) and a certain v g (tw). We have t0 inf g (t0w) , t0w = 1 1 qn (t0w) inf Rf (q (t0w)) , qn (t0w) q (t0w) en 2 16 θ (1 θ) t0w (C.129) 2 16 θ (1 θ) w (C.130) 16θ (1 θ) q en , (C.131) completing the proof. Published as a conference paper at ICLR 2019 D PROOFS FOR SECTION 3.3 D.1 STAYING IN THE REGION S(n+) ζ0 Lemma D.1 (Progress in S(n+) ζ0 \ S(n+) 1 ). Set η = t0/(100 n) for t0 (0, 1). For any ζ0 (0, 1), on the good events stated in Proposition 3.7 and Theorem 3.6, we have for all q S(n+) ζ0 \ S(n+) 1 and q+ being the next step of Riemannian subgradient descent that q2 +,n q+, n 2 q2 n q n 2 1 + t θ (1 θ) ζ0 400n3/2 (1 + ζ0) In particular, we have q+ S(n+) ζ0 . Proof. We divide the index set [n 1] into three sets I0 .= {j [n 1] : qj = 0} , (D.2) I1 .= j [n 1] : q2 n/q2 j > 1 + 2ζ1 = 3, qj = 0 (D.3) I2 .= j [n 1] : q2 n/q2 j 1 + 2ζ1 = 3 . (D.4) We perform different arguments on different sets. We let g (q) Rf (q) be the subgradient taken at q and note by Proposition 3.7 that g 2, and so |gi| 2 for all i [n]. We have q2 +,n q2 +,j = (qn ηgn)2 / q ηg 2 (qj ηgj)2 / q ηg 2 = (qn ηgn)2 (qj ηgj)2 . (D.5) For any j I0, q2 +,n q2 +,j = (qn ηgn)2 η2g2 j = q2 n (1 ηgn/qn)2 η2g2 j (1 2η n)2 4nη2 . (D.6) Provided that η 1/(4 n), 1 2η n 1/2, and so 4nη2 1 16nη2 5 where the last inequality holds when η 1/ For any j I1, q2 +,n q2 +,j q2 n (1 ηgn/qn)2 q2 j + η2g2 j q2 n (1 ηgn/qn)2 q2n/3 + 4η2 = 3 (1 ηgn/qn)2 1 + 12η2/q2n 3 (1 2η n)2 1 + 12nη2 5 where the very last inequality holds when η 1/(26 n). Since q S(n+) ζ0 \ S(n+) 1 , I2 is nonempty. For any j I2, q2 +,n q2 +,j = q2 n q2 j 1 + η gj/qj gn/qn Since gj/qj 2 3n, 1 ηgj/qj 1/2 when η 1/ 4 3n . Conditioned on this and due to that gj/qj gn/qn 0, it follows 1+η gj/qj gn/qn 2 [1+2η (gj/qj gn/qn)]2 h 1+2η 2 3n+2 n i2 1+11η n 2 Published as a conference paper at ICLR 2019 If q2 n/q2 j 2, q2 +,n/q2 +,j 5/2 provided that 1 + 11η n 2 5/2 4 = η 1 100 n. (D.11) As q / S(n+) 1 , we have q2 n/ q n 2 2, so there must be a certain j I2 satisfying q2 n/q2 j 2. We conclude that when 40n, 1 26 n, 1 100 n = 1 100 n, (D.12) the index of largest entries of q+, n remains in I2. On the other hand, when η 1/(100 n), for all j I2, 1 + η gj/qj gn/qn 2 [1 + η (gj/qj gn/qn)]2 1 + η 4nθ (1 θ) ζ0 1 + ζ0 So when η = t/(100 n) for any t (0, 1), q2 +,n q+, n 2 q2 n q n 2 1 + t θ (1 θ) ζ0 400n3/2 (1 + ζ0) completing the proof. Proposition D.2. For any ζ0 (0, 1), on the good events stated in Proposition 3.7 and Theorem 3.6, if the step sizes satisfy η(k) min 1 100 n, 1 ζ0 for all k, (D.15) the iteration sequence will stay in S(n+) ζ0 provided that our initialization q(0) S(n+) ζ0 . Proof. By Lemma D.1, if the current iterate q S(n+) ζ0 \ S(n+) 1 , the next iterate q+ S(n+) ζ0 , provided that η 1/(100 n). Now if the current q S(n+) 1 , i.e., q2 n/q2 j 2 for all j [n 1], we can emulate the analysis of the set I1 in proof of Lemma D.1. Indeed, for any j [n 1], q2 +,n q2 +,j q2 n (1 ηgn/qn)2 q2 j + η2g2 j q2 n (1 2η n)2 q2n/2 + 4η2 2 (1 2η n)2 1 + 8nη2 1 + ζ0, (D.16) where the last inequality holds provided that η (1 ζ0) /(9 n). Combining the two cases finishes the proof. D.2 PROOF OF THEOREM 3.8 As we have η(k) 1 100 n and q(0) S(n+) ζ0 , the entire sequence q(k) k 0 will stay in S(n+) ζ0 by Proposition D.2. For any q and any v Rf (q), we have v, q = 0 and therefore q ηv 2 = q 2 + η2 v 2 1. (D.17) So q ηv is not inside Bn. Since projection onto Bn is a contraction, we have q+ en 2 = q ηv q ηv en 2 q ηv en 2 q en 2+η2 v 2 2η v, q en q en 2+4η2 1 8ηθ (1 θ) n 3/2ζ0 q en , Published as a conference paper at ICLR 2019 where we have used the bounds in Proposition 3.7 and Theorem 3.6 to obtain the last inequality. Further applying Proposition C.7, we have q+ en 2 q en 2 + 4η2 1 16ηθ (1 θ) n 2ζ0 (f (q) f (en)) . (D.19) Summing up the inequalities until step K (assumed 5), we have 0 q(K) en 2 + 4 16θ (1 θ) n 2ζ0 j=0 η(j) f q(j) f (en) (D.20) j=0 η(j) f q(j) f (en) 16 q(K) en 2 + 64 PK j=0 η(j) 2 θ (1 θ) n 2ζ0 (D.21) = f qbest f (en) 16 q(K) en 2 + 64 PK j=0 η(j) 2 θ (1 θ) n 2ζ0 PK j=0 η(j) . (D.22) Substituting the following estimates η(j) 2 1 104n 1 104n 1 1 2α K1 2α + 1 , (D.23) j=0 η(j) 1 102 n 0 t α dt 1 102 n K1 α 1 α , (D.24) and noting 16 q(K) en 2 32, we have f qbest f (en) 3200n5/2 (1 α) + 16/25 n3/2 1 α 1 2αK1 2α + 1 α θ (1 θ) ζ0K1 α . (D.25) Noting that K 6400n5/2 (1 α) θ (1 θ) ζ0ϵ 1 1 α = 3200n5/2 (1 α) θ (1 θ) ζ0K1 α ϵ and when K 1, K1 2α 1, yielding that 1 2α 25ϵθ (1 θ) ζ0 α = 32n3/2 1 α 25θ (1 θ) ζ0 ϵ 2 = 16n3/2 (1 α) 1 1 2αK1 2α + 1 25θ (1 θ) ζ0K1 α ϵ (D.27) So we conclude that when 6400n5/2 (1 α) θ (1 θ) ζ0ϵ 1 2α 25ϵθ (1 θ) ζ0 f qbest f (en) ϵ. When this happens, by Proposition C.8, qbest en 16 θ (1 θ)ϵ. (D.29) Plugging in the choice ζ0 = 1/(5 log n) in Eq. (D.28) gives the desired bound on the number of iterations. E PROOFS FOR SECTION 3.4 E.1 PROOF OF LEMMA 3.9 Lemma E.1. For all n 3 and ζ 0, it holds that vol S(n+) ζ vol (Sn 1) 1 Published as a conference paper at ICLR 2019 We note that a similar result appears in (Gilboa et al., 2018) but our definitions of the region Sζ are slightly different. For completeness we provide a proof in Lemma F.3. We now prove Lemma 3.9. Taking ζ = 1/(5 log n) in Lemma E.1, we obtain vol S(n+) 1/(5 log n) vol (Sn 1) 1 n 1 5 log n 1 By symmetry, all the 2n sets n S(i+) 1/(5 log n), S(i ) 1/(5 log n) : i [n] o have the same volume which is at least 1/(4n). As q(0) Uniform(Sn 1), it falls into their union with probability at least 2n 1/(4n) = 1/2, on which it belongs to a uniformly random one of these 2n sets. E.2 PROOF OF THEOREM 3.10 Assume that the good event in Proposition 3.7 happens and that in Theorem 3.6 happens to all the 2n sets n S(i+) 1/(5 log n), S(i ) 1/(5 log n) : i [n] o , which by setting ζ0 = 1/(5 log n) has probability at least 1 exp( cmθ3ζ2 0n 3 log 1 m) exp( cmθ log 1 m) = 1 exp( c mθ3n 3 log m 3). (E.3) By Lemma 3.9, random initialization will fall these 2n sets with probability at least 1/2. When it falls in one of these 2n sets, by Theorem 3.8, one run of the algorithm will find a signed standard basis vector up to ϵ accuracy. With R independent runs, at least S .= 1 4R of them are effective with probability at least 1 exp (R/4)2/(R/4 2) = 1 exp ( R/8), due to Bernstein s inequality. After these effective runs, the probability any standard basis vector is missed (up to sign) is bounded by n + log n exp S where the second inequality holds whenever S 2n log n. F AUXILIARY CALCULATIONS Lemma F.1. For x BG(θ), x ψ2 Ca. For any vector u Rn and x iid BG(θ), x u ψ2 Cb u . Here Ca, Cb 0 are universal constants. Proof. For any λ R, exp (λx) = θ exp (λx) exp (λx) . (F.1) So x ψ2 is bounded by a universal constant. Moreover, i u2 i xi 2 ψ2 C2 u , (F.2) as claimed. Lemma F.2. Let a1, . . . , am be iid copies of a iid BG(θ). Then, q xi E q x > Ca mn + Cb mt 2 exp t2 . (F.3) for any t 0. Here Ca, Cb 0 are universal constants. Proof. Consider the zero-centered random process defined on Sn 1: Xq .= P i [m] q xi E q x . Then, for any p, q Sn 1, p xi q xi E p x + E q x ψ2 Published as a conference paper at ICLR 2019 p xi q xi E p x + E q x 2 p xi q xi 2 (centering) (F.6) = C3 m p q , (F.8) where we use the estimate in Lemma F.1 to obtain the last inequality. Note that Xq is a mean-zero random process, and we can invoke Proposition A.2 with w(Sn 1) = C4 n and rad Sn 1 = 2 to get the claimed result. Lemma F.3. For all n 3 and ζ 0, it holds that vol S(n+) ζ vol (Sn 1) 1 Proof. We have vol S(n+) ζ vol (Sn 1) = Pq uniform(Sn 1) h q2 n (1 + ζ) q n 2 , qn 0 i (F.10) = Px N(0,In) xn 0, x2 n (1 + ζ) x2 i i = n (F.11) = (2π)n/2 Z xn/ 1+ζ e x2 j/2 dxj = (2π)1/2 Z 0 e x2 n/2ψn 1 xn/ p 1 + ζ dxn (F.13) 0 e (1+ζ)x2/2ψn 1 (x) dx .= h (ζ) > 0, (F.14) where we write ψ(t) .= 1 2π R t t exp s2/2 ds. Now we derive a lower bound of the volume ratio by considering a first-order Taylor expansion of the last equation around ζ = 0 (as we are mostly interested in small ζ). By symmetry, h (0) = 1/(2n). Moreover, we have 0 e x2/2ψn 1 (x) dx 1 0 e x2/2x2ψn 1 (x) dx (F.15) 0 e x2/2x2ψn 1 (x) dx. (F.16) Now we provide an upper bound for the second term of the last equation. Note that 0 e x2/2x2ψn 1 (x) dx = Ex N(0,In) h x2 n1 n x2 n x n 2 o 1 {xn 0} i (F.17) 2n Ex N(0,In) x 2 . (F.18) Now for any λ (0, 1/2), exp λE x 2 E exp λ x 2 j=1 E exp λx2 j = n Ex N(0,1) exp λx2 n Published as a conference paper at ICLR 2019 Taking logarithm on both sides, rearranging the terms, and setting λ = 1/4, we obtain E x 2 inf λ (0,1/2) log n + 1 2 log (1 2λ) 1 λ 4 log n + 2 log 2. (F.20) 4n (4 log n + 2 log 2) 9 provided that n 3. Now we show that h (ζ) h (0) + h (0)ζ by showing that h (ζ) 0. We have 2 x2ψn 1 (x) dx. (F.22) Using integration by part, we have Z 2 x2ψn 1 (x) dx (F.23) = 1 1 + ζ e 1+ζ 2 x2x3 ψn 1 (x) 1 1 + ζ e 1+ζ 2 x2x3 (n 1) ψn 2 (x) 1 1 + ζ e 1+ζ 2 x2x3 (n 1) ψn 2 (x) 2 dx 0, (F.25) and similarly Z 2 x2ψn 1 (x) dx (F.26) = 1 1 + ζ e 1+ζ 2 x2x ψn 1 (x) 1 1 + ζ e 1+ζ 2 x2x (n 1) ψn 2 (x) 1 1 + ζ e 1+ζ 2 x2x (n 1) ψn 2 (x) 2 dx 0. (F.28) Noting that (1 + ζ)2 = x4 3x2 1 + ζ + 1 1 + ζ and combining the above integral results, we conclude that h (ζ) 0 and complete the proof. Lemma F.4. Let (x1, y1), (x2, y2) R2 >0 be two points in the first quadrant satisfying y1 x1 and y2 x2, and y2/x2 y1/x1 [1, 1 + η] for some η 1, then we have ((x1, y1), (x2, y2)) η. Proof. For i = 1, 2, let θi be the angle between the ray (xi, yi) and the x-axis. Our assumption implies that θi [π/4, π/2) and θ2 θ1, thus ((x1, y1), (x2, y2)) = θ2 θ1, so we have tan ((x1, y1), (x2, y2)) = tan θ2 tan θ1 1 + tan θ2 tan θ1 = y2/x2 y1/x1 1 + y2y1/(x2x1) y2/x2 y1/x1 1 y2/x2 + x1/y1 Therefore ((x1, y1), (x2, y2)) arctan(η) η. Published as a conference paper at ICLR 2019 Lemma F.5. For any p, q Sn 1 with the same support pattern such that d E(p, q) ϵ 1 2, we have for all u Sn 1 that Ex BG(θ) |u x|1 sign(p x) = sign(q x) 3ϵ Proof. Fix some threshold t > 0 to be determined. We have E |u x|1 sign(p x) = sign(q x) (F.32) E |u x|1 |u x|> t + E |u x|1 |u x| t, sign(p x) = sign(q x) (F.33) E (u x)2 P |u x|> t 1/2 + t E 1 sign(p x) = sign(q x) (F.34) θ 2 exp( t2/2) 1/2 + ϵt. (F.35) The second to last inequality uses Cauchy-Schwarz, and the last inequality uses the fact that u x = u ΩxΩis uΩ 2 2-sub-Gaussian conditioned on Ωand thus 1-sub-Gaussian marginally. Taking t = q ϵ2 , the above bound simplifies to s 2θ exp log 1 2θϵ 1 + log 1 ϵ (F.36) where we have used θ 1/2 and ϵ 1/2. G RESULTS ON ORTHOGONAL DICTIONARIES 0.5 1 1.5 2 2.5 number of measurements (as power of n) fraction of successful recovery n=30 n=50 n=70 n=100 0.5 1 1.5 2 2.5 number of measurements (as power of n) fraction of successful recovery n=30 n=50 n=70 n=100 0.5 1 1.5 2 2.5 number of measurements (as power of n) fraction of successful recovery n=30 n=50 n=70 n=100 Figure 2: Empirical success rates of recovery of the Riemannian subgradient descent with R = 5n log n runs, averaged over 10 instances. Left to right: orthogonal dictionaries with θ = 0.1, 0.3, 0.5. H FASTER ALTERNATIVE ALGORITHM FOR LARGE-SCALE INSTANCES The Riemannian subgradient descent is cheap per iteration but slow in overall convergence, similar to many other first-order methods. We also test a faster quasi-Newton type method, GRANSO,6 that employs BFGS for solving constrained nonsmooth problems based on sequential quadratic optimization (Curtis et al., 2017). For a large dictionary of dimension n = 400 and sample complexity m = 10n2 (i.e., 1.6 106), GRANSO successfully identifies a basis after 1500 iterations with CPU time 4 hours on a two-socket Intel Xeon E5-2640v4 processor (10-core Broadwell, 2.40 GHz) this is approximately 10 faster than the Riemannian subgradient descent method, showing the potential of quasi-Newton type methods for solving large-scale problems. I EXPERIMENT WITH IMAGES To experiment with images, we follow a typical setup for dictionary learning as used in image processing (Mairal et al., 2014). We focus on testing if complete (i.e., square and invertible) dictionaries are reasonable sparsification bases for real images, instead on any particular image processing or vision tasks. 6Available online: http://www.timmitchell.com/software/GRANSO/. Published as a conference paper at ICLR 2019 -400 -300 -200 -100 0 100 200 300 0 -40 -30 -20 -10 0 10 20 30 40 0 -30 -20 -10 0 10 20 30 0 -10 -5 0 5 10 0 Figure 3: Results on two images. First row: the images; Second row: learned dictionaries; Third row: histograms of the representation coefficients; Fourth row: zoomed-in versions of the histograms around zero. Published as a conference paper at ICLR 2019 Setup Two natural images are picked for this experiment, as shown in the first row of Fig. 3, each of resolution 512 512. Each image is divided into 8 8 non-overlapping blocks, resulting in 64 64 = 4096 blocks. The blocks are then vectorized, and stacked columnwise into a data matrix Y R64 4096. We precondition the data to obtain Y = Y Y 1/2 Y , (I.1) so that nonvanishing singular values of Y are identically one. We then solve formulation (1.1) round (5n log n) times with n = 64 using the BFGS solver based on GRANSO, obtaining round (5n log n) vectors. Negative equivalent copies are pruned and vectors with large correlations with other remaining vectors are sequentially removed until only 64 vectors are left. This forms the final complete dictionary. Results The learned complete dictionaries for the two test images are displayed in the second row of Fig. 3. Visually, the dictionaries seem reasonably adaptive to the image contents: for the left image with prevalent sharp edges, the learned dictionary consists of almost exclusively oriented sharp corners and edges, while for the right image with blurred textures and occasional sharp features, the learned dictionary does seem to be composed of the two kinds of elements. Let the learned dictionary be A. We estimate the representation coefficients as A 1Y . The third row of Fig. 3 contains the histograms of the coefficients. For both images, the coefficients are sharply concentrated around zero (see also the fourth row for zoomed versions of the portions around zero), and the distribution resembles a typical zero-centered Laplace distribution which is a good indication of sparsity. Quantitatively, we calculate the mean sparsity level of the coefficient vectors (i.e., columns of A 1Y ) by the metric 1 / 2: for a vector v Rn, v 1 / v 2 ranges from 1 (when v is one-sparse) to n (when v is fully dense with elements of equal magnitudes), which serves as a good measure of sparsity level for v. For our two images, the sparsity levels by the norm-ratio metric are 5.9135 and 6.4339, respectively, while the fully dense extreme would have a value 64 = 8, suggesting the complete dictionaries we learned are reasonable sparsification bases for the two natural images, respectively.