# learning_sparse_distributions_using_iterative_hard_thresholding__a4616fa8.pdf

Learning Sparse Distributions using Iterative Hard

Thresholding

Jacky Y. Zhang Department of Computer Science University of Illinois at Urbana-Champaign

yiboz@illinois.edu

Rajiv Khanna Department of Statistics University of California at Berkeley

rajivak@berkeley.edu

Anastasios Kyrillidis Department of Computer Science

Rice University rajivak@berkeley.edu

Oluwasanmi Koyejo Department of Computer Science University of Illinois at Urbana-Champaign

sanmi@illinois.edu

Iterative hard thresholding (IHT) is a projected gradient descent algorithm, known to achieve state of the art performance for a wide range of structured estimation problems, such as sparse inference. In this work, we consider IHT as a solution to the problem of learning sparse discrete distributions. We study the hardness of using IHT on the space of measures. As a practical alternative, we propose a greedy approximate projection which simultaneously captures appropriate notions of sparsity in distributions, while satisfying the simplex constraint, and investigate the convergence behavior of the resulting procedure in various settings. Our results show, both in theory and practice, that IHT can achieve state of the art results for learning sparse distributions.

1 Introduction

Probabilistic models provide a ﬂexible approach for capturing uncertainty in real world processes, with a variety of applications which include latent variable models and density estimation, among others. Like other machine learning tools, probabilistic models can be enhanced by encouraging parsimony, as this captures useful inductive biases. In practice, this often improves the interpretability and generalization performance of the resulting models, and is particularly useful in applied settings with limited samples compared to the model degrees of freedom. One of the most effective parsimonious assumptions is sparsity. As such, learning sparse distributions is a problem of broad interest in machine learning, with many applications [1 7].

The majority of approaches for sparse probabilistic modeling have focused on the construction of appropriate priors based on inputs from domain experts. The technical challenges there involve the challenges of prior design and inference [3, 1, 8], including methods that are additionally designed to exploit special structures [5, 4, 7] . More recently, there has been an interest in studying these algorithmic approaches from an optimization perspective [9 11], with the goal of a deeper understanding and, in some cases, even suggesting improvements over previous methods [12, 13].In this work, we consider an optimization-based approach to learning sparse discrete distributions. Despite wide applicability, when compared to classical constrained optimization, there are limited studies that focus on the understanding, both in theory and in practice, of optimization methods over the space of probability densities, under sparsity constraints.

Our present work proposes and investigates the use of Iterative Hard Thresholding (IHT [14 18]) for the problem of sparse probabilistic estimation. IHT is an iterative algorithm that is well-studied in the

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

classical optimization literature. Further, there are known worst-case convergence guarantees and empirical studies [19, 20] that vouch for its performance. Our goal in this work is to investigate the convergence properties of IHT, when applied to probabilistic densities, and to evaluate its efﬁcacy for learning sparse distributions.

However, transferring this algorithm from vector and matrix spaces to the space of measures is not straightforward. While several of the technical pieces such as the existence of a variational derivative and normed structure fall into place, the algorithm is an iterative one, that involves solving a projection subproblem in each iteration. We show that this subproblem is computationally hard in general, but provide an approximate procedure that we analyze under certain assumptions.

Our contributions in this work are algorithmic and theoretical, with proof of concept empirical evaluation. We brieﬂy summarize our contributions below.

We propose the use of classical IHT for learning sparse distributions, and show that the space of

measures meets the structural requirements for IHT. We study in depth the hardness of the projection subproblem, showing that it is NP-hard, and no

polynomial-time algorithm exists that can solve it with guarantees. Since the projection problem is solved in every iteration, we propose a simple greedy algorithm

and provide sufﬁcient theoretical conditions, under which the algorithm provably approximates the otherwise hard projection problem. We draw on techniques from classical optimization to provide convergence rates for the overall

IHT algorithm: i.e., we study after how many iterations will the algorithm guarantee to be within some small of the true optimum.

In addition to our conceptual and theoretical results, we present empirical studies that support our claims.

2 Problem statement

Preliminaries. We use bold characters to denote vectors. Given a vector v, we use vi to represent its i-th entry. We use calligraphic upper case letters to denote sets; e.g., S. With a slight abuse of notation, we will use lower case letters to denote probability distributions e.g., p, q, as well as functions e.g., f. The distinction from scalars will be apparent from the context; we usually append functions with parentheses to distinguish from scalars. We use upper case letters to denote functionals i.e., functions that take as an input other functions e.g., F[p( )]. We use [n] to denote the set {1, 2, ...n}. Given a set of indices S [n], we denote the cardinality of S as |S|. Given a vector x, we denote its support set i.e., the set of non-zero entries, as supp(x). We use P{e} to denote the probability of event e.

Let P denote the set of discrete n-dimensional probability densities on an n-dimensional domain X :

p( ) : X ! R+ |

Let S [n] denote a support set where |S| = k < n. Let XS X denote the set of variables with support S, i.e.,

x 2 X | supp(x) S

The set of domain restricted densities, denoted by PS, is the set of probability density functions supported on XS; i.e.,

PS = {q( ) 2 P | 8x /2 XS, q(x) = 0} .

Inversely, we denote the support of a domain restricted density q( ) 2 PS as supp(q) = S. Next, we deﬁne the notion of sparse distributions. Deﬁnition 1 (Distribution Sparsity [5]). Let Dk = [|S| k PS P i.e., the union of all possible k-sparse support domain restricted densities. We say that p( ) is k-sparse if p( ) 2 Dk.

Note that while each component PS is a convex set, the union Dk is not. To see this, consider the convex combination of two k-sparse distributions p1 and p2 with disjoint supports S1 and S2 respectively. In general, the convex combination p1( ) + (1 )p2( ); 0 < < 1, has larger support; i.e., |S1 [ S2| > k. As an aside, we note that unlike the vector case, its is straightforward

to construct multiple deﬁnitions of distribution sparsity. For instance, another reasonable deﬁnition is via the set D0

k = {p( ) 2 P | p(x) = 0 for all kxk0 > k}; i.e., distributions that assign zero probability mass to non-k-sparse vectors. Interestingly, Dk D0

k P in general, as any of the distributions in Dk must has a support with size less than k, which is not necessary for distributions in D0

k. Motivated by prior work [5], we use Deﬁnition 1 in this work.

Vector sparsity. While the proposed framework is developed for a specialized notion of sparsity i.e. along the dimensions of a multivariate discrete distribution, it is also applicable to alternative notions of distribution sparsity. One common setting is sparsity of the distribution itself p( ) when represented as a vector e.g. sparsifying the number of valid states of a univariate distribution such as a histogram. We outline how our framework can be applied to this setting in the Appendix A.

Problem setting. In this work, we focus on studying sparsity for the case of discrete densities. In particular, X Zn; i.e., x is an integer such that:

X = {x 2 Zn | 8i 2 [n], 0 xi m 1} , where m is an integer. Therefore, x has mn valid positions. In other words, if we denote X as a random variable from that distribution, then X 2 X has mn possible values, and P{X = x} = p(x).

Given a cost functional over distributions F[ ] : P ! R, we are interested in the following optimization criterion:

q F [q] subject to q 2 Dk, (1) where Dk = [S:|S| k PS P is the k-sparsity constraint, as in Deﬁnition 1. In words, we are interested in ﬁnding a distribution, denoted as q( ), that lives in the k-sparse set of distributions, and minimizes the cost functional F[ ]. This is similar to classical sparse optimization problems in literature [21 24], but there are fundamental difﬁculties, both in theory and in practice, that require a different approach than standard iterative hard thresholding algorithms [14 18].

We assume that the objective F[ ] is a convex functional over distributions. Deﬁnition 2 (Convexity of F[ ]). The functional F[ ] : P ! R is convex if:

F [ q( ) + (1 )p( )] F[q( )] + (1 )F[p( )], for all q( ), p( ) 2 P and 2 [0, 1].

Observe that, while F[ ] is a convex functional, and P and PS are convex sets, Dk is not a convex set. Hence, the optimization problem (1) is not a convex program.

Following the projected gradient descent approach, we require deﬁnitions of the gradient over F[ ], as well as deﬁnitions of the projection. Deﬁnition 3 (Variational Derivative [25]). The variational derivative of F[ ] : P ! R is a function, denoted as δF

δq ( ) : X ! R, and satisﬁes:

δq (x)φ(x) = @F [q+ φ]

where φ : X ! R is an arbitrary function. Deﬁnition 4 (First-order Convexity). The functional F[ ] : P ! R is convex if:

F [q( )] F [p( )] +

δp ( ), q( ) p( )

for all q( ), p( ) 2 P.

Here, we use the standard inner product for two densities: hq( ), p( )i =

x q(x)p(x), or hq( ), p( )i = P

x q(x)p(x) in the discrete setting.

3 Algorithms

Recall that our goal is to solve the optimization problem (1). A natural way to solve it in an iterative fashion is using projected gradient descent, where the projection step is over the set of sparse distributions Dk. This analogy makes the connection to iterative hard thresholding (IHT) algorithms, where the iterative recursion is:

pt+1( ) = Dk

where pt( ) denotes the current iterate, and Dk( ) denotes, in an abstract sense, the projection of the distribution function to the set of sparse distribution functions.

Algorithm 1 Distribution IHT

1: Input: F[ ] : P ! R, k 2 Z+. number of

iters T, p0( ) 2 Dk, µ. Output: p T 2 Dk 2: t 0 3: while t < T do 4: qt+1( ) = pt( ) µ δF

δpt ( ) 5: pt+1( ) = Dk (qt+1) 6: end while 7: return p T ( )

The consequent steps are analogous to those of regular IHT: given an initialization point, we iteratively i) compute the gradient, ii) perform the gradient step with step size µ, iii) ensure the computed approximate solution satisﬁes our constraint in each iteration by projecting to Dk.

3.1 Projection onto Dk Consider the projection step with respect to the 2-norm i.e.

Dk (p( )) := arg min

kq( ) p( )k2

Figure 1: Illustration of projection onto Dk, with q = Dk (p).

where the 2-norm is deﬁned by the aforementioned inner product hq( ), p( )i = P

x q(x)p(x). The set Dk = [|S| k PS is a union of (n

k) = O(nk) sparse sets PS of different supports. Thus, if we denote Tproj as the time to compute PS (p( )), then we need O(nk Tproj) time for Dk projection using naive enumeration. One may reasonably conjecture the existence of more efﬁcient implementations of the exact projection in (2), e.g., in polynomial time. In the following, we show that this is not the case.

3.2 On the tractability of sparse distribution 2-norm projection

The projection (2) is iteratively solved in IHT (step 5 in Algorithm 1). Thus, for the algorithm to be practical, it is important to study the tractability of the projection step. The combinatorial nature of Dk hints that this might not be the case. Theorem 1. The sparse distribution 2-norm projection problem (2) is NP-hard.

Sketch of proof: We show that the subset selection problem [26] can be reduced to the 2-norm projection problem. The complete proof is provided in the supplementary material.

As an alternative route, NP-hard problems can be often tackled sufﬁciently, by using approximate methods. However, the following theorem states that the sparsity constrained optimization problem in (2) is hard even to approximate, in the sense that no deterministic approximation algorithm exists that solves it in polynomial time. Theorem 2. There exists no deterministic algorithm that can provide a constant factor approximation for the sparse distribution 2-norm projection problem in polynomial time. Formally, for given q : X ! R with X 2 Rn, let p?( ) be the optimal 2-norm projection onto Dk, and let bp( ) be the solution found by any algorithm that operates in O(poly(n)) time. Then, we can design problem instances, where the approximation ratio:

' = kq( ) bp( )k2

2 kq( ) p?( )k2

cannot be bounded.

The proof of the theorem is provided in the supplementary material. Through Theorems 1 and 2, we have shown that the distribution sparse 2-norm projection problem is hard, and thus the applicability of IHT on the space of densities seems not to be well-established to be practical. This may be surprising, in light of results in a variety of domains where it is known to be effective. For example, in case of vectors, a simple O(n) selection algorithm solves the projection problem optimally [27]. Similarly, on the space of matrices for low rank IHT, the projection onto the top-k ranks is optimally solved by an SVD [28].

3.3 A greedy approximation

Algorithm 2 Greedy Sparse Projection (GSProj)

1: Input: n-dimensional function q : X ! R and sparsity

level k. 2: Output: A distribution p( ) 2 Dk 3: S := ; 4: while |S| < k do 5: j 2 arg mini2[n]\S

minp2PS[i kp( ) q( )k2

6: S := S [ j 7: end while 8: return arg minp2PS kp( ) q( )k2

In contrast to the results of Theorems 1 and 2, we have observed that a simple greedy support selection seems effective in practice. Thus, we simply consider replacing exact projection to Dk by greedy selection.

Consider Algorithm 2 when the input is not necessarily a distribution, i.e., P

x2X q(x) 6= 1. The key procedure of the projection is line 5, where the inner min( ) is the projection of q( ) on a set of domain restricted densities. Let

bp( ) denote this projection, i.e., bp( ) = arg minp( )2PS kp( ) q( )k2

2. Since, by deﬁnition bp(x) = 0 for any x /2 XS, we only need to calculate bp(x) where x 2 XS, and this can be reformulated as:

(p(x) q(x))2 s.t.

p(x) = 1 and 8x2XSp(x) 0,

which is essentially 2-norm projection onto a simplex {p(x) | P

x2XS p(x) = 1, 8x2XSp(x) 0}. This 2-norm projection onto the simplex can be solved efﬁciently and easily (See [29]).

When p( ) is a distribution, we can analytically compute its projection on any support restricted domain. Given support S, the exact projection of a distribution p( ) onto PS is:

q2PS kq( ) p( )k2

In our setting, the above problem can be written as

q2PS kq( ) p( )k2

2 = arg min

hq( ) p( ), q( ) p( )i = arg min

(q(x) p(x))2

(q(x) p(x))2 +

The last equation is due to deﬁnition of PS and XS. Since p( ) is constant, we can eliminate the last term. Further, since q 2 PS, we have that q(x) = 0 for every x /2 XS. The resulting problem is:

(q(x) p(x))2 s.t.

q(x) = 1. (4)

x2XS p(x) = C 1. Applying the Quadratic Mean-Arithmetic Mean inequality to equation (4), we have: X

(q(x) p(x))2 (1 C)2 /|XS| s.t.

The equality can be achieved when q(x) p(x) is the same for every x 2 XS. Therefore we have the optimal solution to Problem (3):

|XS| , x 2 XS 0, x /2 XS

Computational complexity. The time we need to solve Problem (3) is O(|XS|), i.e. the time to compute C. However, to compute the norm kq( ) p( )k2

2 we still need O(|X|) time, as p(x) is not necessarily zero at any x 2 X. As a result, we need O(nk(|X| + |XS|)) time to enumerate for an optimal solution of the 2-norm projection. If we consider the integer lattice X, as stated in the problem setting, then |X| = mn and |XS| = mk, rendering the time complexity O(nkmn). However, Algorithm 2 has much lower time complexity. In each iteration, the greedy method selects an element to put into S that maximize the gain, which requires k iterations. It need not to consider the exact 2-norm kq( ) p( )k2

2 in each iteration, only the increment for each e from n options. To compute the increment, no more than |XS| terms are added, which requires compute of O(|XS|) time complexity. All together, the greedy method requires O(k|XS|) time to operate, or O(nkmk) in our integer lattice setting, which is far less that the enumeration method s O(nkmn).

3.4 When Greedy is Good

We have shown in the proof of Theorem 2 that there always exist extreme examples that are hard to solve. Thus, in the most general sense, and without further assumptions, one can ﬁnd pathological cases which make the problem hard. However, we ﬁnd that the greedy approach works well empirically. In this section, we consider sufﬁcient conditions for tractability of the problem. Our conditions boil down to structural assumptions on F[ ] which match standard assumptions in the literature.

To build further intuition, consider line 4 in Algorithm 1, where the parameter passed to the greedy method is q( ) = p( ) µ δF

δp ( ), and p( ) is already a k-sparse distribution. Denote the support of p( ) as S; we can see that |S| k. Therefore, that q( ) is close to k-sparse when the step size µ is small. Thus, while the general problem (2) may be a lot harder, there is reason to conjecture that under certain conditions, a simple greedy algorithm performs well. Next, we state these assumptions formally. Assumption 1 (Strong Convexity/Smoothness). The objective F[ ] satisﬁes Strong Convexity/Smoothness with respect to and β if:

2 kp1( ) p2( )k2

2 F[p1( )] F[p2( )]

( ), p2( ) p1( )

2 kp1( ) p2( )k2

For the sake of simplicity in exposition, we have assumed strong convexity to hold over the entire domain (which can be a restrictive assumption). As will be clear from the proof analysis, this assumption can easily be tightened to a restricted strong convexity assumption; see, e.g., [30]. This detail is left for a longer version of this manuscript. Assumption 2 (Lipschitz Condition). The functional F : P ! R satisﬁes the Lipschitz condition with respect to L, in k-sparse domain Dk is

|F[p1( )] F[p2( )]| Lkp1( ) p2( )k2

This assumption implies that ////

Using the strong convexity, smoothness, and Lipschitz assumptions, we are able to provide analysis for when greedy works well. This is encapsulated in Theorem 3.

Theorem 3. Given n-dimensional function q( ) = p( ) µ δF

δp ( ), where p( ) is an n-dimensional k-sparse distribution and supp(p( )) = S0, Algorithm 2 ﬁnds the optimal projection to domain PS0 if F[ ] satisﬁes Assumption 2, µ is sufﬁciently small and there are enough positions x 2 XS0 where p(x) > 0, i.e., satisﬁes inequality (6) and inequality (9).

3.5 Convergence Analysis

Next, we analyze the convergence of the overall Algorithm 1 with greedy projections. While Theorem 3 provides sufﬁcient conditions for exact projection using the greedy approach, in practice due to computational precision issues and/or violation of the stated assumptions, the solution may not provide an exact projection. Thus, it is prudent to assume that the inner projection subproblem is solved within some approximation as quantiﬁed in the following.

Deﬁnition 5. Approximate 2-norm projection. We deﬁne b Dk( ) as the approximate projection onto sparsity domain and distribution space, with approximation parameter, φ, as:

///p( ) b Dk(p( ))

2 (1 + φ) kp( ) Dk(p( ))k2

Next, we present our main convergence theorem. Theorem 4. Suppose F satisﬁes assumptions 1 and 2. Furthermore, assume that the projection step in Algorithm 1 is solved φ-approximately. Let the step size µ = 1/β, and kp0( ) p?( )k2 L/(2 ). Then if β

2 (2 1 1+φ, 2), IHT (Algorithm 1) with T log

F [p0( )] F [p?( )] c iterations achieves

F[p T ( )] F[p?( )]+c+ , where = 1 (1+φ)(2 β/ ) and c = (φ/(2β)+(1+φ)(β )/(2 2))L2

(1+φ)(2 β/ ) .

0 100 200 300 400 iteration

F[p] - F[p*]

IHT Greedy IHT after Greedy

(a) Normalized 2-norm Minimization

0 500 1000 1500 2000 iteration

F[p] - F[p*]

IHT Greedy IHT after Greedy

(b) Normalized KL Minimization

Figure 2: Simulated Experiments

4 Experiments

Algorithm 3 Greedy Selection

1: Input: F[ ] : P ! R, k 2 Z+. Output:

p T 2 Dk 2: S := ; 3: while |S| < k do 4: j 2 arg mini2[n]\S {minp2PS[i F[p( )]} 5: S := S [ j 6: end while 7: return arg minp2PS F[p( )]

We evaluate our algorithm on different convex objectives, namely, 2-norm distance and KL divergence. As mentioned before, there are no theoretically guaranteed algorithms for 2-norm distance minimization under sparsity constraint. To investigate optimality of the algorithms, we consider simulated experiments of sufﬁciently small size that the global optimal can be exhaustively enumerated.

IHT implementation details. For IHT, the step size is chosen by a simple strategy: given an initial step size, we double the step size when IHT is trapped in local optima, and return to the initial step size after escaping. We return the algorithm along the entire solution path.

Baseline: Forward Greedy Selection. Unfortunately, we are unaware of optimization algorithms for sparse probability estimation with general losses. As as a simple baseline, we consider greedy selection wrt. the objective. This is equivalent to Algorithm 3. For certain special cases e.g. KL objective, Algorithm 3 can be applied efﬁciently and is effective in practice [5].

4.1 Simulated Data

We set dimension n = 15, number of entries m = 2, sparsity level k = 7. That is, X = {0, 1}15 is a 15-dimensional binary vector space, with cardinality |X| = 215 = 32768. The distribution p : X ! [0, 1] satisﬁes P

x2X p(x) = 1. The sparsity constraint is designed to ﬁx a support S : |S| 7, such that for any x : p(x) > 0 has supp(x) = S. Thus, the optimal solution is requires enumerating

= 6435 possible supports.

The 2-norm minimization objective is F[p( )] = kp( ) q( )k2

2 where q( ) is a distribution generated by randomly choosing 50 positions x1 x50 2 X to assign random real numbers c1 c50 : P50

i=1 ci = 1 and the other positions are assigned to 0, i.e., q(xi) = ci for i 2 [50], and q(x) = 0 otherwise. Initial step size µ = 0.008. Results are shown in Figure 2 (a). For the KL divergence objective, it is F[p( )] = KL(p( )||q( )) = P

x2X p(x) log p(x)

q(x), where q( ) is a random distribution generated similar to the q( ) in 2-norm objective. The only difference is that q(x) can not be zero as it would render the KL undeﬁned. For simulated experiments, we use the optimum to normalize the objective function as F[p] = F[p] F[p?], so that at the optimum F[p?] = 0.

Three algorithms are compared in each experiment, i.e., IHT, Greedy and IHT after Greedy. While IHT starts randomly, IHT after Greedy is initialized by the result of Greedy. In each run, the distribution q( ) and the starting distribution for IHT p0( ) are randomly generated. Each of the experiments are run 20 times. Results are presented in showing the mean and standard deviation of Greedy and IHT after Greedy. The standard deviation of IHT is similar to that of IHT after Greedy.

We use the 2-norm greedy projection in IHT in both experiments. Interestingly, this not only outperforms the 2-norm greedy projection itself (Figure 2 (a)), but also outperforms Greedy on the KL objective (Figure 2 (b)), where [5] suggests provably good performance. In particularly, while the performance of Greedy can ﬂuctuate severely, IHT (after Greedy) is stable in obtaining good results. Note that low variance is especially desirable when the algorithm is only applied a few times to save computation, as in large discrete optimization problems.

4.2 Benchmark Data

Distribution Compression / Compressed sensing. We apply our IHT to the task of expectationpreserving distribution compression, useful for efﬁciently storing large probability tables. Given a distribution p( ), our goal is to construct a sparse approximation q( ), such that q( ) approximately preserves expectations with respect to p( ). Interestingly, this model compression problem is equivalent to compressed sensing, but with the distributional constraints. Speciﬁcally, our goal is to ﬁnd q which minimizes ||Aq Ap||2

2 subject to a k-sparsity constraint on q. The model is evaluated with respect to moment reconstruction ||Bq Bp||2

2 for a new "sensing" matrix B. Our experiments use real data from the Texas hospital discharge public use dataset. IHT is compared to post-precessed Lasso and Random. Lasso ignores the simplex constraints during optimization, then projects the results to the simplex, while Random is a naïve baseline of random distributions. Figure 3(a) shows that IHT signiﬁcantly outperforms baselines. Additional details are provided in Appendix H due to limited space.

Dataset compression. We study representative prototype selection for the Digits data [31]. Prototypes are representative examples chosen from the data in order to achieve dataset compression. Our optimization objective is the Maximum Mean Discrepancy (MMD) between the discrete data distribution and the sparse data distribution representing the selected samples. We evaluate performance using the prototype nearest neighbor classiﬁcation error on a test dataset. We compare two forward selection greedy variants (Local Greedy and Global Greedy) proposed by [32] and the means algorithm (labeled as PS) proposed by [33], both state of the art. The results are presented in Figure 3(b) showing that IHT outperforms all baselines. Additional experimental details are provided in Appendix H due to limited space.

100 200 300 400 500 k

IHT Lasso Random

125 300 500 1100 2500 4500

Number of Prototypes Selected

IHT PS Global Greedy Local Greedy

Figure 3: (a) Compression / Compressed sensing. Test Error at varying sparsity k. (b) Dataset Compression. Test Classiﬁcation error of prototype nearest neighbor classiﬁer

5 Conclusion and Future Work

In this work, we proposed the use of IHT for learning discrete sparse distributions. We study several theoretical properties of the algorithm from an optimization viewpoint, and propose practical solutions to solve otherwise hard problems. There are several possible future directions of research. We have analyzed discrete distributions with sparsity constraints. The obvious extensions are to the space of continuous measures and structured sparsity constraints. Is there a bigger class of constraints for which the a tractable projection algorithm exists? Can we improve the sufﬁcient conditions under which projections are provably close to the optimum projection? Finally, more in-depth empirical studies compared to other state of the art algorithms should be very interesting and useful to the community.

[1] T. J. Mitchell and J. J. Beauchamp. Bayesian variable selection in linear regression. Journal of

the American Statistical Association, 83(404):1023 1032, 1988.

[2] Hemant Ishwaran and J. Sunil Rao. Spike and slab variable selection: Frequentist and bayesian

strategies. Ann. Statist., 33(2):730 773, 04 2005.

[3] Edward I. George and Robert E. Mc Culloch. Variable selection via gibbs sampling. Journal of

the American Statistical Association, 88(423):881 889, 1993.

[4] Trevor Park and George Casella. The bayesian lasso. Journal of the American Statistical

Association, 103(482):681 686, 2008.

[5] Oluwasanmi O Koyejo, Rajiv Khanna, Joydeep Ghosh, and Russell Poldrack. On prior distribu-

tions and approximate inference for structured variables. In Advances in Neural Information Processing Systems, pages 676 684, 2014.

[6] Rajiv Khanna, Joydeep Ghosh, Russell Poldrack, and Oluwasanmi Koyejo. Sparse submodular

probabilistic pca. In Artiﬁcial Intelligence and Statistics, pages 453 461, 2015.

[7] Rajiv Khanna, Joydeep Ghosh, Rusell Poldrack, and Oluwasanmi Koyejo. Information Pro-

jection and Approximate Inference for Structured Sparse Variables. In Aarti Singh and Jerry Zhu, editors, Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 1358 1366, Fort Lauderdale, FL, USA, 20 22 Apr 2017. PMLR.

[8] CARLOS M. CARVALHO, NICHOLAS G. POLSON, and JAMES G. SCOTT. The horseshoe

estimator for sparse signals. Biometrika, 97(2):465 480, 2010.

[9] Andre Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as

a composite optimization problem. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 2093 3027. PMLR, 06 09 Jul 2018.

[10] Francesco Locatello, Rajiv Khanna, Joydeep Ghosh, and Gunnar Ratsch. Boosting variational

inference: an optimization perspective. In Amos Storkey and Fernando Perez-Cruz, editors, Proceedings of the Twenty-First International Conference on Artiﬁcial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 464 472, Playa Blanca, Lanzarote, Canary Islands, 09 11 Apr 2018. PMLR.

[11] Arnak S.Dalalyan and Avetik Karagulyan. User-friendly guarantees for the langevin monte

carlo with inaccurate gradient. Stochastic Processes and their Applications, 2019.

[12] Ferenc Huszár and David Duvenaud. Optimally-weighted herding is bayesian quadrature. In

Proceedings of the Twenty-Eighth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 12, pages 377 386, 2012.

[13] Francesco Locatello, Gideon Dresdner, Rajiv Khanna, Isabel Valera, and Gunnar Raetsch.

Boosting black box variational inference. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3401 3411. Curran Associates, Inc., 2018.

[14] Thomas Blumensath and Mike E. Davies. Iterative hard thresholding for compressed sensing.

Co RR, 2008.

[15] Anastasios Kyrillidis and Volkan Cevher. Recipes on hard thresholding methods. In 2011 4th

IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pages 353 356. IEEE, 2011.

[16] Deanna Needell and Joel A Tropp. Cosamp: Iterative signal recovery from incomplete and

inaccurate samples. Applied and computational harmonic analysis, 26(3):301 321, 2009.

[17] Wei Dai and Olgica Milenkovic. Subspace pursuit for compressive sensing signal reconstruction.

IEEE transactions on Information Theory, 55(5):2230 2249, 2009.

[18] Joel A Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions

on Information theory, 50(10):2231 2242, 2004.

[19] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for

high-dimensional m-estimation. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, NIPS 14, pages 685 693, 2014.

[20] Rajiv Khanna and Anastasios Kyrillidis. Iht dies hard: Provable accelerated iterative hard

thresholding. AISTATS, 12 2018.

[21] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289

1306, 2006.

[22] Robert Tibshirani, Martin Wainwright, and Trevor Hastie. Statistical learning with sparsity: the

lasso and generalizations. Chapman and Hall/CRC, 2015.

[23] Suvrit Sra, Sebastian Nowozin, and Stephen J Wright. Optimization for machine learning. Mit

Press, 2012.

[24] Junzhou Huang, Tong Zhang, and Dimitris Metaxas. Learning with structured sparsity. Journal

of Machine Learning Research, 12(Nov):3371 3412, 2011.

[25] Eberhard Engel and Reiner M Dreizler. Density functional theory. Springer.

[26] Jon Kleinberg and Eva Tardos. Algorithm design. Pearson Education India, 2006.

[27] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efﬁcient projections

onto the 1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pages 272 279. ACM, 2008.

[28] Anastasios Kyrillidis and Volkan Cevher. Matrix recipes for hard thresholding methods. Journal

of mathematical imaging and vision, 48(2):235 265, 2014.

[29] Anastasios Kyrillidis, Stephen Becker, Volkan Cevher, and Christoph Koch. Sparse projections

onto the simplex. In International Conference on Machine Learning, pages 235 243, 2013.

[30] Alekh Agarwal, Sahand Negahban, and Martin J Wainwright. Fast global convergence rates of

gradient methods for high-dimensional statistical recovery. In Advances in Neural Information Processing Systems, pages 37 45, 2010.

[31] J. J. Hull. A database for handwritten text recognition research. IEEE Trans. Pattern Anal.

Mach. Intell., 16(5):550 554, May 1994.

[32] Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples are not enough, learn to

criticize! criticism for interpretability. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2280 2288. Curran Associates, Inc., 2016.

[33] Jacob Bien and Robert Tibshirani. Prototype selection for interpretable classiﬁcation. Ann.

Appl. Stat., 5(4):2403 2424, 12 2011.