# learning_sparse_distributions_using_iterative_hard_thresholding__a4616fa8.pdf Learning Sparse Distributions using Iterative Hard Thresholding Jacky Y. Zhang Department of Computer Science University of Illinois at Urbana-Champaign yiboz@illinois.edu Rajiv Khanna Department of Statistics University of California at Berkeley rajivak@berkeley.edu Anastasios Kyrillidis Department of Computer Science Rice University rajivak@berkeley.edu Oluwasanmi Koyejo Department of Computer Science University of Illinois at Urbana-Champaign sanmi@illinois.edu Iterative hard thresholding (IHT) is a projected gradient descent algorithm, known to achieve state of the art performance for a wide range of structured estimation problems, such as sparse inference. In this work, we consider IHT as a solution to the problem of learning sparse discrete distributions. We study the hardness of using IHT on the space of measures. As a practical alternative, we propose a greedy approximate projection which simultaneously captures appropriate notions of sparsity in distributions, while satisfying the simplex constraint, and investigate the convergence behavior of the resulting procedure in various settings. Our results show, both in theory and practice, that IHT can achieve state of the art results for learning sparse distributions. 1 Introduction Probabilistic models provide a flexible approach for capturing uncertainty in real world processes, with a variety of applications which include latent variable models and density estimation, among others. Like other machine learning tools, probabilistic models can be enhanced by encouraging parsimony, as this captures useful inductive biases. In practice, this often improves the interpretability and generalization performance of the resulting models, and is particularly useful in applied settings with limited samples compared to the model degrees of freedom. One of the most effective parsimonious assumptions is sparsity. As such, learning sparse distributions is a problem of broad interest in machine learning, with many applications [1 7]. The majority of approaches for sparse probabilistic modeling have focused on the construction of appropriate priors based on inputs from domain experts. The technical challenges there involve the challenges of prior design and inference [3, 1, 8], including methods that are additionally designed to exploit special structures [5, 4, 7] . More recently, there has been an interest in studying these algorithmic approaches from an optimization perspective [9 11], with the goal of a deeper understanding and, in some cases, even suggesting improvements over previous methods [12, 13].In this work, we consider an optimization-based approach to learning sparse discrete distributions. Despite wide applicability, when compared to classical constrained optimization, there are limited studies that focus on the understanding, both in theory and in practice, of optimization methods over the space of probability densities, under sparsity constraints. Our present work proposes and investigates the use of Iterative Hard Thresholding (IHT [14 18]) for the problem of sparse probabilistic estimation. IHT is an iterative algorithm that is well-studied in the 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. classical optimization literature. Further, there are known worst-case convergence guarantees and empirical studies [19, 20] that vouch for its performance. Our goal in this work is to investigate the convergence properties of IHT, when applied to probabilistic densities, and to evaluate its efficacy for learning sparse distributions. However, transferring this algorithm from vector and matrix spaces to the space of measures is not straightforward. While several of the technical pieces such as the existence of a variational derivative and normed structure fall into place, the algorithm is an iterative one, that involves solving a projection subproblem in each iteration. We show that this subproblem is computationally hard in general, but provide an approximate procedure that we analyze under certain assumptions. Our contributions in this work are algorithmic and theoretical, with proof of concept empirical evaluation. We briefly summarize our contributions below. We propose the use of classical IHT for learning sparse distributions, and show that the space of measures meets the structural requirements for IHT. We study in depth the hardness of the projection subproblem, showing that it is NP-hard, and no polynomial-time algorithm exists that can solve it with guarantees. Since the projection problem is solved in every iteration, we propose a simple greedy algorithm and provide sufficient theoretical conditions, under which the algorithm provably approximates the otherwise hard projection problem. We draw on techniques from classical optimization to provide convergence rates for the overall IHT algorithm: i.e., we study after how many iterations will the algorithm guarantee to be within some small of the true optimum. In addition to our conceptual and theoretical results, we present empirical studies that support our claims. 2 Problem statement Preliminaries. We use bold characters to denote vectors. Given a vector v, we use vi to represent its i-th entry. We use calligraphic upper case letters to denote sets; e.g., S. With a slight abuse of notation, we will use lower case letters to denote probability distributions e.g., p, q, as well as functions e.g., f. The distinction from scalars will be apparent from the context; we usually append functions with parentheses to distinguish from scalars. We use upper case letters to denote functionals i.e., functions that take as an input other functions e.g., F[p( )]. We use [n] to denote the set {1, 2, ...n}. Given a set of indices S [n], we denote the cardinality of S as |S|. Given a vector x, we denote its support set i.e., the set of non-zero entries, as supp(x). We use P{e} to denote the probability of event e. Let P denote the set of discrete n-dimensional probability densities on an n-dimensional domain X : p( ) : X ! R+ | Let S [n] denote a support set where |S| = k < n. Let XS X denote the set of variables with support S, i.e., x 2 X | supp(x) S The set of domain restricted densities, denoted by PS, is the set of probability density functions supported on XS; i.e., PS = {q( ) 2 P | 8x /2 XS, q(x) = 0} . Inversely, we denote the support of a domain restricted density q( ) 2 PS as supp(q) = S. Next, we define the notion of sparse distributions. Definition 1 (Distribution Sparsity [5]). Let Dk = [|S| k PS P i.e., the union of all possible k-sparse support domain restricted densities. We say that p( ) is k-sparse if p( ) 2 Dk. Note that while each component PS is a convex set, the union Dk is not. To see this, consider the convex combination of two k-sparse distributions p1 and p2 with disjoint supports S1 and S2 respectively. In general, the convex combination p1( ) + (1 )p2( ); 0 < < 1, has larger support; i.e., |S1 [ S2| > k. As an aside, we note that unlike the vector case, its is straightforward to construct multiple definitions of distribution sparsity. For instance, another reasonable definition is via the set D0 k = {p( ) 2 P | p(x) = 0 for all kxk0 > k}; i.e., distributions that assign zero probability mass to non-k-sparse vectors. Interestingly, Dk D0 k P in general, as any of the distributions in Dk must has a support with size less than k, which is not necessary for distributions in D0 k. Motivated by prior work [5], we use Definition 1 in this work. Vector sparsity. While the proposed framework is developed for a specialized notion of sparsity i.e. along the dimensions of a multivariate discrete distribution, it is also applicable to alternative notions of distribution sparsity. One common setting is sparsity of the distribution itself p( ) when represented as a vector e.g. sparsifying the number of valid states of a univariate distribution such as a histogram. We outline how our framework can be applied to this setting in the Appendix A. Problem setting. In this work, we focus on studying sparsity for the case of discrete densities. In particular, X Zn; i.e., x is an integer such that: X = {x 2 Zn | 8i 2 [n], 0 xi m 1} , where m is an integer. Therefore, x has mn valid positions. In other words, if we denote X as a random variable from that distribution, then X 2 X has mn possible values, and P{X = x} = p(x). Given a cost functional over distributions F[ ] : P ! R, we are interested in the following optimization criterion: q F [q] subject to q 2 Dk, (1) where Dk = [S:|S| k PS P is the k-sparsity constraint, as in Definition 1. In words, we are interested in finding a distribution, denoted as q( ), that lives in the k-sparse set of distributions, and minimizes the cost functional F[ ]. This is similar to classical sparse optimization problems in literature [21 24], but there are fundamental difficulties, both in theory and in practice, that require a different approach than standard iterative hard thresholding algorithms [14 18]. We assume that the objective F[ ] is a convex functional over distributions. Definition 2 (Convexity of F[ ]). The functional F[ ] : P ! R is convex if: F [ q( ) + (1 )p( )] F[q( )] + (1 )F[p( )], for all q( ), p( ) 2 P and 2 [0, 1]. Observe that, while F[ ] is a convex functional, and P and PS are convex sets, Dk is not a convex set. Hence, the optimization problem (1) is not a convex program. Following the projected gradient descent approach, we require definitions of the gradient over F[ ], as well as definitions of the projection. Definition 3 (Variational Derivative [25]). The variational derivative of F[ ] : P ! R is a function, denoted as δF δq ( ) : X ! R, and satisfies: δq (x)φ(x) = @F [q+ φ] where φ : X ! R is an arbitrary function. Definition 4 (First-order Convexity). The functional F[ ] : P ! R is convex if: F [q( )] F [p( )] + δp ( ), q( ) p( ) for all q( ), p( ) 2 P. Here, we use the standard inner product for two densities: hq( ), p( )i = x q(x)p(x), or hq( ), p( )i = P x q(x)p(x) in the discrete setting. 3 Algorithms Recall that our goal is to solve the optimization problem (1). A natural way to solve it in an iterative fashion is using projected gradient descent, where the projection step is over the set of sparse distributions Dk. This analogy makes the connection to iterative hard thresholding (IHT) algorithms, where the iterative recursion is: pt+1( ) = Dk where pt( ) denotes the current iterate, and Dk( ) denotes, in an abstract sense, the projection of the distribution function to the set of sparse distribution functions. Algorithm 1 Distribution IHT 1: Input: F[ ] : P ! R, k 2 Z+. number of iters T, p0( ) 2 Dk, µ. Output: p T 2 Dk 2: t 0 3: while t < T do 4: qt+1( ) = pt( ) µ δF δpt ( ) 5: pt+1( ) = Dk (qt+1) 6: end while 7: return p T ( ) The consequent steps are analogous to those of regular IHT: given an initialization point, we iteratively i) compute the gradient, ii) perform the gradient step with step size µ, iii) ensure the computed approximate solution satisfies our constraint in each iteration by projecting to Dk. 3.1 Projection onto Dk Consider the projection step with respect to the 2-norm i.e. Dk (p( )) := arg min kq( ) p( )k2 Figure 1: Illustration of projection onto Dk, with q = Dk (p). where the 2-norm is defined by the aforementioned inner product hq( ), p( )i = P x q(x)p(x). The set Dk = [|S| k PS is a union of (n k) = O(nk) sparse sets PS of different supports. Thus, if we denote Tproj as the time to compute PS (p( )), then we need O(nk Tproj) time for Dk projection using naive enumeration. One may reasonably conjecture the existence of more efficient implementations of the exact projection in (2), e.g., in polynomial time. In the following, we show that this is not the case. 3.2 On the tractability of sparse distribution 2-norm projection The projection (2) is iteratively solved in IHT (step 5 in Algorithm 1). Thus, for the algorithm to be practical, it is important to study the tractability of the projection step. The combinatorial nature of Dk hints that this might not be the case. Theorem 1. The sparse distribution 2-norm projection problem (2) is NP-hard. Sketch of proof: We show that the subset selection problem [26] can be reduced to the 2-norm projection problem. The complete proof is provided in the supplementary material. As an alternative route, NP-hard problems can be often tackled sufficiently, by using approximate methods. However, the following theorem states that the sparsity constrained optimization problem in (2) is hard even to approximate, in the sense that no deterministic approximation algorithm exists that solves it in polynomial time. Theorem 2. There exists no deterministic algorithm that can provide a constant factor approximation for the sparse distribution 2-norm projection problem in polynomial time. Formally, for given q : X ! R with X 2 Rn, let p?( ) be the optimal 2-norm projection onto Dk, and let bp( ) be the solution found by any algorithm that operates in O(poly(n)) time. Then, we can design problem instances, where the approximation ratio: ' = kq( ) bp( )k2 2 kq( ) p?( )k2 cannot be bounded. The proof of the theorem is provided in the supplementary material. Through Theorems 1 and 2, we have shown that the distribution sparse 2-norm projection problem is hard, and thus the applicability of IHT on the space of densities seems not to be well-established to be practical. This may be surprising, in light of results in a variety of domains where it is known to be effective. For example, in case of vectors, a simple O(n) selection algorithm solves the projection problem optimally [27]. Similarly, on the space of matrices for low rank IHT, the projection onto the top-k ranks is optimally solved by an SVD [28]. 3.3 A greedy approximation Algorithm 2 Greedy Sparse Projection (GSProj) 1: Input: n-dimensional function q : X ! R and sparsity level k. 2: Output: A distribution p( ) 2 Dk 3: S := ; 4: while |S| < k do 5: j 2 arg mini2[n]\S minp2PS[i kp( ) q( )k2 6: S := S [ j 7: end while 8: return arg minp2PS kp( ) q( )k2 In contrast to the results of Theorems 1 and 2, we have observed that a simple greedy support selection seems effective in practice. Thus, we simply consider replacing exact projection to Dk by greedy selection. Consider Algorithm 2 when the input is not necessarily a distribution, i.e., P x2X q(x) 6= 1. The key procedure of the projection is line 5, where the inner min( ) is the projection of q( ) on a set of domain restricted densities. Let bp( ) denote this projection, i.e., bp( ) = arg minp( )2PS kp( ) q( )k2 2. Since, by definition bp(x) = 0 for any x /2 XS, we only need to calculate bp(x) where x 2 XS, and this can be reformulated as: (p(x) q(x))2 s.t. p(x) = 1 and 8x2XSp(x) 0, which is essentially 2-norm projection onto a simplex {p(x) | P x2XS p(x) = 1, 8x2XSp(x) 0}. This 2-norm projection onto the simplex can be solved efficiently and easily (See [29]). When p( ) is a distribution, we can analytically compute its projection on any support restricted domain. Given support S, the exact projection of a distribution p( ) onto PS is: q2PS kq( ) p( )k2 In our setting, the above problem can be written as q2PS kq( ) p( )k2 2 = arg min hq( ) p( ), q( ) p( )i = arg min (q(x) p(x))2 (q(x) p(x))2 + The last equation is due to definition of PS and XS. Since p( ) is constant, we can eliminate the last term. Further, since q 2 PS, we have that q(x) = 0 for every x /2 XS. The resulting problem is: (q(x) p(x))2 s.t. q(x) = 1. (4) x2XS p(x) = C 1. Applying the Quadratic Mean-Arithmetic Mean inequality to equation (4), we have: X (q(x) p(x))2 (1 C)2 /|XS| s.t. The equality can be achieved when q(x) p(x) is the same for every x 2 XS. Therefore we have the optimal solution to Problem (3): |XS| , x 2 XS 0, x /2 XS Computational complexity. The time we need to solve Problem (3) is O(|XS|), i.e. the time to compute C. However, to compute the norm kq( ) p( )k2 2 we still need O(|X|) time, as p(x) is not necessarily zero at any x 2 X. As a result, we need O(nk(|X| + |XS|)) time to enumerate for an optimal solution of the 2-norm projection. If we consider the integer lattice X, as stated in the problem setting, then |X| = mn and |XS| = mk, rendering the time complexity O(nkmn). However, Algorithm 2 has much lower time complexity. In each iteration, the greedy method selects an element to put into S that maximize the gain, which requires k iterations. It need not to consider the exact 2-norm kq( ) p( )k2 2 in each iteration, only the increment for each e from n options. To compute the increment, no more than |XS| terms are added, which requires compute of O(|XS|) time complexity. All together, the greedy method requires O(k|XS|) time to operate, or O(nkmk) in our integer lattice setting, which is far less that the enumeration method s O(nkmn). 3.4 When Greedy is Good We have shown in the proof of Theorem 2 that there always exist extreme examples that are hard to solve. Thus, in the most general sense, and without further assumptions, one can find pathological cases which make the problem hard. However, we find that the greedy approach works well empirically. In this section, we consider sufficient conditions for tractability of the problem. Our conditions boil down to structural assumptions on F[ ] which match standard assumptions in the literature. To build further intuition, consider line 4 in Algorithm 1, where the parameter passed to the greedy method is q( ) = p( ) µ δF δp ( ), and p( ) is already a k-sparse distribution. Denote the support of p( ) as S; we can see that |S| k. Therefore, that q( ) is close to k-sparse when the step size µ is small. Thus, while the general problem (2) may be a lot harder, there is reason to conjecture that under certain conditions, a simple greedy algorithm performs well. Next, we state these assumptions formally. Assumption 1 (Strong Convexity/Smoothness). The objective F[ ] satisfies Strong Convexity/Smoothness with respect to and β if: 2 kp1( ) p2( )k2 2 F[p1( )] F[p2( )] ( ), p2( ) p1( ) 2 kp1( ) p2( )k2 For the sake of simplicity in exposition, we have assumed strong convexity to hold over the entire domain (which can be a restrictive assumption). As will be clear from the proof analysis, this assumption can easily be tightened to a restricted strong convexity assumption; see, e.g., [30]. This detail is left for a longer version of this manuscript. Assumption 2 (Lipschitz Condition). The functional F : P ! R satisfies the Lipschitz condition with respect to L, in k-sparse domain Dk is |F[p1( )] F[p2( )]| Lkp1( ) p2( )k2 This assumption implies that //// Using the strong convexity, smoothness, and Lipschitz assumptions, we are able to provide analysis for when greedy works well. This is encapsulated in Theorem 3. Theorem 3. Given n-dimensional function q( ) = p( ) µ δF δp ( ), where p( ) is an n-dimensional k-sparse distribution and supp(p( )) = S0, Algorithm 2 finds the optimal projection to domain PS0 if F[ ] satisfies Assumption 2, µ is sufficiently small and there are enough positions x 2 XS0 where p(x) > 0, i.e., satisfies inequality (6) and inequality (9). 3.5 Convergence Analysis Next, we analyze the convergence of the overall Algorithm 1 with greedy projections. While Theorem 3 provides sufficient conditions for exact projection using the greedy approach, in practice due to computational precision issues and/or violation of the stated assumptions, the solution may not provide an exact projection. Thus, it is prudent to assume that the inner projection subproblem is solved within some approximation as quantified in the following. Definition 5. Approximate 2-norm projection. We define b Dk( ) as the approximate projection onto sparsity domain and distribution space, with approximation parameter, φ, as: ///p( ) b Dk(p( )) 2 (1 + φ) kp( ) Dk(p( ))k2 Next, we present our main convergence theorem. Theorem 4. Suppose F satisfies assumptions 1 and 2. Furthermore, assume that the projection step in Algorithm 1 is solved φ-approximately. Let the step size µ = 1/β, and kp0( ) p?( )k2 L/(2 ). Then if β 2 (2 1 1+φ, 2), IHT (Algorithm 1) with T log F [p0( )] F [p?( )] c iterations achieves F[p T ( )] F[p?( )]+c+ , where = 1 (1+φ)(2 β/ ) and c = (φ/(2β)+(1+φ)(β )/(2 2))L2 (1+φ)(2 β/ ) . 0 100 200 300 400 iteration F[p] - F[p*] IHT Greedy IHT after Greedy (a) Normalized 2-norm Minimization 0 500 1000 1500 2000 iteration F[p] - F[p*] IHT Greedy IHT after Greedy (b) Normalized KL Minimization Figure 2: Simulated Experiments 4 Experiments Algorithm 3 Greedy Selection 1: Input: F[ ] : P ! R, k 2 Z+. Output: p T 2 Dk 2: S := ; 3: while |S| < k do 4: j 2 arg mini2[n]\S {minp2PS[i F[p( )]} 5: S := S [ j 6: end while 7: return arg minp2PS F[p( )] We evaluate our algorithm on different convex objectives, namely, 2-norm distance and KL divergence. As mentioned before, there are no theoretically guaranteed algorithms for 2-norm distance minimization under sparsity constraint. To investigate optimality of the algorithms, we consider simulated experiments of sufficiently small size that the global optimal can be exhaustively enumerated. IHT implementation details. For IHT, the step size is chosen by a simple strategy: given an initial step size, we double the step size when IHT is trapped in local optima, and return to the initial step size after escaping. We return the algorithm along the entire solution path. Baseline: Forward Greedy Selection. Unfortunately, we are unaware of optimization algorithms for sparse probability estimation with general losses. As as a simple baseline, we consider greedy selection wrt. the objective. This is equivalent to Algorithm 3. For certain special cases e.g. KL objective, Algorithm 3 can be applied efficiently and is effective in practice [5]. 4.1 Simulated Data We set dimension n = 15, number of entries m = 2, sparsity level k = 7. That is, X = {0, 1}15 is a 15-dimensional binary vector space, with cardinality |X| = 215 = 32768. The distribution p : X ! [0, 1] satisfies P x2X p(x) = 1. The sparsity constraint is designed to fix a support S : |S| 7, such that for any x : p(x) > 0 has supp(x) = S. Thus, the optimal solution is requires enumerating = 6435 possible supports. The 2-norm minimization objective is F[p( )] = kp( ) q( )k2 2 where q( ) is a distribution generated by randomly choosing 50 positions x1 x50 2 X to assign random real numbers c1 c50 : P50 i=1 ci = 1 and the other positions are assigned to 0, i.e., q(xi) = ci for i 2 [50], and q(x) = 0 otherwise. Initial step size µ = 0.008. Results are shown in Figure 2 (a). For the KL divergence objective, it is F[p( )] = KL(p( )||q( )) = P x2X p(x) log p(x) q(x), where q( ) is a random distribution generated similar to the q( ) in 2-norm objective. The only difference is that q(x) can not be zero as it would render the KL undefined. For simulated experiments, we use the optimum to normalize the objective function as F[p] = F[p] F[p?], so that at the optimum F[p?] = 0. Three algorithms are compared in each experiment, i.e., IHT, Greedy and IHT after Greedy. While IHT starts randomly, IHT after Greedy is initialized by the result of Greedy. In each run, the distribution q( ) and the starting distribution for IHT p0( ) are randomly generated. Each of the experiments are run 20 times. Results are presented in showing the mean and standard deviation of Greedy and IHT after Greedy. The standard deviation of IHT is similar to that of IHT after Greedy. We use the 2-norm greedy projection in IHT in both experiments. Interestingly, this not only outperforms the 2-norm greedy projection itself (Figure 2 (a)), but also outperforms Greedy on the KL objective (Figure 2 (b)), where [5] suggests provably good performance. In particularly, while the performance of Greedy can fluctuate severely, IHT (after Greedy) is stable in obtaining good results. Note that low variance is especially desirable when the algorithm is only applied a few times to save computation, as in large discrete optimization problems. 4.2 Benchmark Data Distribution Compression / Compressed sensing. We apply our IHT to the task of expectationpreserving distribution compression, useful for efficiently storing large probability tables. Given a distribution p( ), our goal is to construct a sparse approximation q( ), such that q( ) approximately preserves expectations with respect to p( ). Interestingly, this model compression problem is equivalent to compressed sensing, but with the distributional constraints. Specifically, our goal is to find q which minimizes ||Aq Ap||2 2 subject to a k-sparsity constraint on q. The model is evaluated with respect to moment reconstruction ||Bq Bp||2 2 for a new "sensing" matrix B. Our experiments use real data from the Texas hospital discharge public use dataset. IHT is compared to post-precessed Lasso and Random. Lasso ignores the simplex constraints during optimization, then projects the results to the simplex, while Random is a naïve baseline of random distributions. Figure 3(a) shows that IHT significantly outperforms baselines. Additional details are provided in Appendix H due to limited space. Dataset compression. We study representative prototype selection for the Digits data [31]. Prototypes are representative examples chosen from the data in order to achieve dataset compression. Our optimization objective is the Maximum Mean Discrepancy (MMD) between the discrete data distribution and the sparse data distribution representing the selected samples. We evaluate performance using the prototype nearest neighbor classification error on a test dataset. We compare two forward selection greedy variants (Local Greedy and Global Greedy) proposed by [32] and the means algorithm (labeled as PS) proposed by [33], both state of the art. The results are presented in Figure 3(b) showing that IHT outperforms all baselines. Additional experimental details are provided in Appendix H due to limited space. 100 200 300 400 500 k IHT Lasso Random 125 300 500 1100 2500 4500 Number of Prototypes Selected IHT PS Global Greedy Local Greedy Figure 3: (a) Compression / Compressed sensing. Test Error at varying sparsity k. (b) Dataset Compression. Test Classification error of prototype nearest neighbor classifier 5 Conclusion and Future Work In this work, we proposed the use of IHT for learning discrete sparse distributions. We study several theoretical properties of the algorithm from an optimization viewpoint, and propose practical solutions to solve otherwise hard problems. There are several possible future directions of research. We have analyzed discrete distributions with sparsity constraints. The obvious extensions are to the space of continuous measures and structured sparsity constraints. Is there a bigger class of constraints for which the a tractable projection algorithm exists? Can we improve the sufficient conditions under which projections are provably close to the optimum projection? Finally, more in-depth empirical studies compared to other state of the art algorithms should be very interesting and useful to the community. [1] T. J. Mitchell and J. J. Beauchamp. Bayesian variable selection in linear regression. Journal of the American Statistical Association, 83(404):1023 1032, 1988. [2] Hemant Ishwaran and J. Sunil Rao. Spike and slab variable selection: Frequentist and bayesian strategies. Ann. Statist., 33(2):730 773, 04 2005. [3] Edward I. George and Robert E. Mc Culloch. Variable selection via gibbs sampling. Journal of the American Statistical Association, 88(423):881 889, 1993. [4] Trevor Park and George Casella. The bayesian lasso. Journal of the American Statistical Association, 103(482):681 686, 2008. [5] Oluwasanmi O Koyejo, Rajiv Khanna, Joydeep Ghosh, and Russell Poldrack. On prior distribu- tions and approximate inference for structured variables. In Advances in Neural Information Processing Systems, pages 676 684, 2014. [6] Rajiv Khanna, Joydeep Ghosh, Russell Poldrack, and Oluwasanmi Koyejo. Sparse submodular probabilistic pca. In Artificial Intelligence and Statistics, pages 453 461, 2015. [7] Rajiv Khanna, Joydeep Ghosh, Rusell Poldrack, and Oluwasanmi Koyejo. Information Pro- jection and Approximate Inference for Structured Sparse Variables. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 1358 1366, Fort Lauderdale, FL, USA, 20 22 Apr 2017. PMLR. [8] CARLOS M. CARVALHO, NICHOLAS G. POLSON, and JAMES G. SCOTT. The horseshoe estimator for sparse signals. Biometrika, 97(2):465 480, 2010. [9] Andre Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as a composite optimization problem. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 2093 3027. PMLR, 06 09 Jul 2018. [10] Francesco Locatello, Rajiv Khanna, Joydeep Ghosh, and Gunnar Ratsch. Boosting variational inference: an optimization perspective. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 464 472, Playa Blanca, Lanzarote, Canary Islands, 09 11 Apr 2018. PMLR. [11] Arnak S.Dalalyan and Avetik Karagulyan. User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications, 2019. [12] Ferenc Huszár and David Duvenaud. Optimally-weighted herding is bayesian quadrature. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, UAI 12, pages 377 386, 2012. [13] Francesco Locatello, Gideon Dresdner, Rajiv Khanna, Isabel Valera, and Gunnar Raetsch. Boosting black box variational inference. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3401 3411. Curran Associates, Inc., 2018. [14] Thomas Blumensath and Mike E. Davies. Iterative hard thresholding for compressed sensing. Co RR, 2008. [15] Anastasios Kyrillidis and Volkan Cevher. Recipes on hard thresholding methods. In 2011 4th IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 353 356. IEEE, 2011. [16] Deanna Needell and Joel A Tropp. Cosamp: Iterative signal recovery from incomplete and inaccurate samples. Applied and computational harmonic analysis, 26(3):301 321, 2009. [17] Wei Dai and Olgica Milenkovic. Subspace pursuit for compressive sensing signal reconstruction. IEEE transactions on Information Theory, 55(5):2230 2249, 2009. [18] Joel A Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information theory, 50(10):2231 2242, 2004. [19] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for high-dimensional m-estimation. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS 14, pages 685 693, 2014. [20] Rajiv Khanna and Anastasios Kyrillidis. Iht dies hard: Provable accelerated iterative hard thresholding. AISTATS, 12 2018. [21] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289 1306, 2006. [22] Robert Tibshirani, Martin Wainwright, and Trevor Hastie. Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, 2015. [23] Suvrit Sra, Sebastian Nowozin, and Stephen J Wright. Optimization for machine learning. Mit Press, 2012. [24] Junzhou Huang, Tong Zhang, and Dimitris Metaxas. Learning with structured sparsity. Journal of Machine Learning Research, 12(Nov):3371 3412, 2011. [25] Eberhard Engel and Reiner M Dreizler. Density functional theory. Springer. [26] Jon Kleinberg and Eva Tardos. Algorithm design. Pearson Education India, 2006. [27] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the 1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pages 272 279. ACM, 2008. [28] Anastasios Kyrillidis and Volkan Cevher. Matrix recipes for hard thresholding methods. Journal of mathematical imaging and vision, 48(2):235 265, 2014. [29] Anastasios Kyrillidis, Stephen Becker, Volkan Cevher, and Christoph Koch. Sparse projections onto the simplex. In International Conference on Machine Learning, pages 235 243, 2013. [30] Alekh Agarwal, Sahand Negahban, and Martin J Wainwright. Fast global convergence rates of gradient methods for high-dimensional statistical recovery. In Advances in Neural Information Processing Systems, pages 37 45, 2010. [31] J. J. Hull. A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell., 16(5):550 554, May 1994. [32] Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples are not enough, learn to criticize! criticism for interpretability. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2280 2288. Curran Associates, Inc., 2016. [33] Jacob Bien and Robert Tibshirani. Prototype selection for interpretable classification. Ann. Appl. Stat., 5(4):2403 2424, 12 2011.