# learning_small_predictors__134a2d6f.pdf

Learning SMa LL Predictors

Vikas K. Garg CSAIL, MIT vgarg@csail.mit.edu

Ofer Dekel Microsoft Research oferd@microsoft.com

Lin Xiao Microsoft Research lin.xiao@microsoft.com

We introduce a new framework for learning in severely resource-constrained settings. Our technique delicately amalgamates the representational richness of multiple linear predictors with the sparsity of Boolean relaxations, and thereby yields classiﬁers that are compact, interpretable, and accurate. We provide a rigorous formalism of the learning problem, and establish fast convergence of the ensuing algorithm via relaxation to a minimax saddle point objective. We supplement the theoretical foundations of our work with an extensive empirical evaluation.

1 Introduction

Modern advances in machine learning have produced models that achieve unprecedented accuracy on standard prediction tasks. However, this remarkable progress in model accuracy has come at a signiﬁcant cost. Many state-of-the-art models have ballooned in size and applying them to a new point can require tens of GFLOPs, which renders these methods ineffectual on resource-constrained platforms like smart phones and wearables [1, 2]. Indeed, in these settings, inference with a compact learner that can ﬁt on the small device becomes an overarching determinant even if it comes at the expense of slightly worse accuracy. Moreover, large models are often difﬁcult to interpret, simply because humans are not good at reasoning about large, complex objects. Modern machine learning models are also more costly to train, but we sidestep that problem in this paper by assuming that we can train our models on powerful servers in the cloud.

In our pursuit of compact and interpretable models, we take inspiration from the classic problem of learning disjunctive normal forms (DNFs) [3]. Speciﬁcally, a p-term k-DNF is a DNF with p terms, where each term contains exactly k Boolean variables. Small DNFs are a natural starting point for our research, because they pack a powerful nonlinear descriptive capacity in a succinct form. The DNF structure is also known to be intuitive and interpretable by humans [4, 5]. However, with the exception of a few practical heuristics [4, 5, 6], an overwhelming body of work [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17] theoretically characterizes the difﬁculty of learning a k-DNF under various restricted models of learning. Our method, Sparse Multiprototype Linear Learner (SMa LL), bypasses this issue by crafting a continuous relaxation that amounts to a form of improper learning of the k-DNFs in the sense that the hypothesis space subsumes p-term k-DNF classiﬁers, and thus is at least as powerful as the original k-DNF family. Armed with our technical paraphernalia, we design a practical algorithm that yields small and interpretable models.

Our work may also be viewed as a delicate fusion of multiple prototypes [1, 18, 19, 20, 21, 22] with Boolean relaxations [23]. The richness of models with multiple prototypes overcomes the representational limitations of sparse linear models like Lasso and Elastic-Net [24, 25, 26] that are typically not expressive enough to achieve state-of-the-art accuracy. Boolean relaxations afford us the ability to control the degree of sparsity explicitly in our predictors akin to exploiting an 0 regularization, unlike the 1 based methods that may require extensive tuning. Thus, our approach harnesses the best of both worlds. Moreover, folding sparsity in the training objective obviates the costs that would otherwise be incurred in compressing a large model via methods like pruning [27, 28, 29], low-rank approximation [30, 31], hashing [32], or parameter quantization [27, 33].

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

Additionally, we overcome some signiﬁcant limitations of other methods that use a small number of prototypes, such as [1, 19, 34]. These techniques invariably require solving highly non-convex or combinatorially hard mixed integer optimization problems, which makes it difﬁcult to guarantee their convergence and optimality. We derive a minimax saddle-point relaxation that provably admits O.1=t/ convergence via our customized Mirror-Prox algorithm. We provide detailed empirical results that demonstrate the beneﬁts of our approach on a large variety of Open ML datasets. Speciﬁcally, on many of these datasets, our algorithm either surpasses the accuracy of the state-of-the-art baselines, or provides more compact models while being competent in terms of accuracy.

In Section 2, we formulate the problem of learning a k-sparse p-prototype linear predictor as a mixed integer nonlinear optimization problem. Then, in Section 3, we relax this optimization problem to a saddle-point problem, which we solve using a Mirror-Prox algorithm. Finally, we present empirical results in Section 4. All the proofs are provided in the Supplementary to keep the exposition focused.

2 Problem Formulation

We ﬁrst derive a convex loss function for multiprototype binary classiﬁcation. Let f.xi; yi/gm i D1 be a training set of instance-label pairs, where each xi 2 Rn and each yi 2 f 1; 1g. Let W R 7! R be a convex surrogate for the error indicator function 1f .xi/ yi D 1 if yif .xi/ < 0 and 0 otherwise. We also assume that upper bounds the error indicator function and is monotonically non-increasing. In particular, the popular hinge-loss and log-loss functions satisfy these properties.

Let fwj gp j D1 be a set of linear prototypes. We consider a binary classiﬁer of the form

f .x/ D sign max j 2Œp wj x :

Our decision rule is motivated by the following result.

Proposition 1. Consider the class Ck D f.w1; w2; : : : ; wp/j8j 2 Œp ; wj 2 Rn; jjwj jj0 D kg of p prototypes, where each prototype is k-sparse for k 0. For any x 2 Rn, let the predictors f D .w1; w2; : : : ; wp/ 2 Ck take the following form:

f .x/ D 1 if max j2Œp wj x k; and 1 otherwise:

Learning Ck amounts to improper learning of p-term k-DNF Boolean formulae.

Thus, our search space contains the family of k-DNF classiﬁers, though owing to the hardness of learning k-DNF, we may not always ﬁnd a k-DNF classiﬁer. Nonetheless, due to improper learning, the value of the objective returned will be a lower bound on the cost objective achieved by the space of p-term k-DNF classiﬁers (much like the relation between an integer program and its relaxation).

We handle the negative and positive examples separately. For each negative training example .xi; 1/, the classiﬁer makes a correct prediction if and only if maxj 2Œp wj xi < 0. Under our assumptions on , the error indicator function can be upper bounded as

1f .xi/ 1 max j 2Œp wj xi D max j 2Œp . wj xi/;

where the equality holds because we assume that is monotonically non-increasing. We note that the upper bound maxj 2Œp . wj xi/ is jointly convex in fwj gp j D1 [35, Section 3.2.3].

For each positive example .xi; C1/, the classiﬁer makes a correct prediction if and only if maxj2Œp wj xi > 0. By our assumptions on , we have

1f .xi/ C1 max j 2Œp wj xi D min j 2Œp .wj xi/: (1)

Again, the equality above is due to the monotonic non-increasing property of . Here the right-hand side minj2Œp .wj xi/ is not convex in fwj gp j D1. We resolve this by designating a dedicated prototype wj.i/ for each positive training example .xi; C1/, and using the upper bound

1f .xi/ C1 .wj.i/ xi/:

In the extreme case, we can associate each positive example with a distinct prototype. Then there will be no loss of using .wj.i/ xi/ compared with the upper bound in (1) when we set j.i/ D arg maxj 2Œp wj xi. However, in this case, the number of prototypes p is equal to the number of positive examples, which can be excessively large for storage and computation as well as cause overﬁtting. In practice, we may cluster the positive examples into p groups, where p is much smaller than the number of positive examples, and assign all positive examples in each group with a common prototype. In other words, we have j.i/ D j.k/ if xi and xk belong to the same cluster. This clustering step helps us provide a fast parametric alternative to the essentially non-parametric setting that assumes one prototype per positive example.

Overall, we have the following convex surrogate for the total number of training errors:

h.w1; : : : ; wp/ D X

i2IC wj.i/ xi C X

i2I max j 2Œp . wj xi/; (2)

where IC D fi W yi D C1g and I D fi W yi D 1g. In the rest of this paper, we let W 2 Rp n be the matrix formed by stacking the vectors w T 1 ; : : : ; w T p vertically, and denote the above loss function by h.W /. In order to train a multi-prototype classiﬁer, we minimize the regularized surrogate loss:

min W 2Rp n 1 mh.W / C

2 k W k2 F ; (3)

where k k F denotes the Frobenius norm of a matrix.

2.1 Smoothing the Loss via Soft-Max

In this paper, we focus on the log-loss .z/ D log.1Cexp. z//. Although this is a smooth function, the overall loss h deﬁned in (2) is non-smooth, due to the max operator in the sum over the set I . In order to take advantage of fast algorithms for smooth convex optimization, we smooth the loss function using soft-max. More speciﬁcally, we replace the non-smooth terms maxj 2Œp .tj / in (2) with the soft-max operator over p items:

u.t/ log 1 C X

j2Œp exp. tj / ; (4)

where t D .t1; : : : ; tp/ 2 Rp. Then we obtain the smoothed loss function

i2IC wj.i/ xi C X

i2I u.W xi/; (5)

around which we will customize our algorithm design. Next, we incorporate sparsity constraints explicitly for the prototypes w1; : : : ; wp.

2.2 Incorporating Sparsity via Binary Variables

With some abuse of notation, we let kwj k0 denote the number of non-zero entries of the vector wj , and deﬁne k W k0;1 max j2Œp kwj k0:

The requirement that each prototype be k-sparse translates into the constraint k W k0;1 k. Therefore the problem of training a SMa LL model with budget k (for each prototype) can be formulated as

min W 2Rp n jj W jj0;1 k

1 m Qh.W / C

2 jj W jj2 F ; (6)

where Qh is deﬁned in (5). This is a very hard optimization problem due to the nonconvex sparsity constraint. In order to derive a convex relaxation, we follow the approach of [23] (cf. [36]) to introduce a binary matrix 2 f0; 1gp n and rewrite (6) as

min W 2Rp n 2f0;1gp n; k k1;1 k

1 m Qh.W ˇ / C

2 jj W ˇ jj2 F ;

where ˇ denotes the Hadamard (i.e. entry-wise) product of two matrices. Here we have

k k1;1 D max j2Œp k j k1 ;

where j is the j th row of . Since all entries of belong to f0; 1g, the constraint k k1;1 k is the same as k k0;1 k. Noting that we can take Wij D 0 when ij D 0 and vice-versa, this problem is equivalent to

min W 2Rp n 2f0;1gp n; jj jj1;1 k

1 m Qh.W ˇ / C

2 jj W jj2 F : (7)

Using (5), the objective function can be written as 1 m

i2IC .W ˇ /j.i/xi C X

i2I u .W ˇ /xi C

2 k W k2 F ;

where .W ˇ /j.i/ denotes the j.i/th row of W ˇ .

So far our transformations have not changed the nature of the optimization problem with sparsity constraints it is still a hard mixed-integer nonlinear optimization problem. However, as we will show in the next section, the introduction of the binary matrix allows us to derive a saddle-point formulation of problem (7), which in turn admits a convex-concave relaxation that can be solved efﬁciently by the Mirror-Prox algorithm [37, 38].

3 Saddle-Point Relaxation

We ﬁrst show that the problem in (7) is equivalent to the following minimax saddle-point problem:

min W 2Rp n 2f0;1gp n; jj jj1;1 k max SDŒs1 sm si2Si; i2Œm ˆ.W; ; S/; (8)

where S 2 Rp m, each of its column si belongs to a set Si Rp (which will be given in Proposition 2), and the function ˆ is deﬁned as

ˆ.W; ; S/D 1

yis T i .W ˇ /xi u?.si/ C

2 jj W jj2 F :

In the above deﬁnition, u? is the convex conjugate of u deﬁned in (4):

u?.si/ D sup t2Rp s T i t u.t/ (9)

D Pp j D1. si;j / log. si;j / C .1C1T si/ log.1C1T si/; if si;j 0 8j 2 Œp and 1T si 1; 1; otherwise :

The equivalence between (7) and (8) is a direct consequence of the following proposition. Proposition 2. Let .z/ D log.1 C exp. z// where z 2 R and u.t/ D log 1 C P j 2Œp exp. tj /

where t 2 Rp. Then for i 2 IC, we have

.W ˇ /j.i/xi D max si2Si

yis T i .W ˇ /xi u?.si/ ;

where yi D C1 and Si D si 2 Rp W si;j.i/ 2Œ 1; 0 ; si;j D0 8j j.i/ : For i 2 I , we have

u .W ˇ /xi D max si2Si

yis T i .W ˇ /xi u?.si/ ;

where yi D 1 and Si D si 2 Rp W 1T s 1; si;j 0 8j 2 Œp :

We can further eliminate the variable W in (8). This is facilitated by the following result. Proposition 3. For any given 2 f0; 1gp n and S 2 S1 Sm, the solution to

min W 2Rp n ˆ.W; ; S/

is unique and given by

W. ; S/ D 1

i2Œm yi six T i ˇ : (10)

Figure 1: Decision surfaces of different classiﬁer types on a run of the two-dimensional chscase funds toy dataset. Test classiﬁcation accuracy is shown at the bottom right of each plot.

Now we substitute W. ; S/ into (8) to obtain

min 2f0;1gp n k k1;1 k

max SDŒs1 sm si2Si; i2Œm . ; S/ ; (11)

i2Œm yi six T i ˇ

i2Œm u?.si/ :

We note that . ; S/ is concave in S (which is to be maximized), but not convex in (which is to be minimized). However, because 2 f0; 1gp n, we have ˇ D and thus X

i2Œm yi six T i ˇ

i2Œm yis T i

i2Œm yi six T i ˇ ˇ xi D X

i2Œm yis T i

i2Œm yi six T i ˇ xi:

Therefore the objective function in (11) can be written as

i2Œm yis T i

i2Œm yi six T i ˇ xi X

i2Œm u?.si/; (12)

which is concave in S and linear (thus convex) in .

Finally, we relax the integrality constraint on to its convex hull, i.e., 2 Œ0; 1 p n, and consider

min 2Œ0;1 p n k k1;1 k

max SDŒs1 sm si2Si; i2Œm . ; S/; (13)

where . ; S/ is given in (12). This is a convex-concave saddle-point problem, which can be solved efﬁciently, for example, by the Mirror-Prox algorithm [37, 38].

After ﬁnding a solution . ; S/ of the relaxed problem (13), we can round the entries of to f0; 1g, while respecting the constraint k k1;1 k (e.g., by rounding the largest k entries of each row to 1 and the rest entries to 0, or randomized rounding). Then we can recover the prototypes using (10).

3.1 The Mirror-Prox Algorithm

Algorithm 1 lists the Mirror-Prox algorithm customized for solving the convex-concave saddle-point problem (13), which enjoys a O.1=t/ convergence rate [37, 38].

Table 1: Comparison of test accuracy on low dimensional (n < 20) Open ML datasets. K, in SMa LL, was set to n for these datasets.

LSVM RF AB LR DT k NN RSVM GB GP SMa LL bankruptcy .84 .07 .83 .08 .82 .05 .90 .05 .80 .05 .78 .07 .89 .06 .81 .05 .90 .05 .92 .06

vineyard .79 .10 .72 .06 .68 .04 .82 .08 .69 .13 .70 .11 .82 .07 .68 .09 .71 .12 .83 .07 sleuth1714 .82 .03 .82 .04 .81 .14 .83 .04 .83 .06 .82 .04 .76 .03 .82 .06 .80 .03 .83 .05 sleuth1605 .66 .09 .70 .07 .64 .08 .70 .07 .63 .09 .66 .05 .65 .09 .65 .09 .72 .07 .72 .05 sleuth1201 .94 .05 .94 .03 .92 .05 .93 .03 .91 .05 .90 .04 .89 .09 .88 .06 .91 .08 .94 .05 rabe266 .93 .04 .90 .03 .91 .04 .92 .04 .91 .03 .92 .03 .93 .04 .90 .04 .95 .04 .94 .02 rabe148 .95 .04 .93 .04 .91 .08 .95 .04 .89 .07 .92 .05 .91 .06 .91 .08 .95 .02 .96 .04 vis_env .66 .04 .68 .05 .66 .03 .65 .08 .62 .04 .57 .03 .69 .06 .64 .03 .65 .09 .69 .03 hutsof99 .74 .07 .66 .04 .64 .09 .73 .07 .60 .10 .66 .11 .66 .14 .67 .05 .70 .05 .75 .04 human_dev .88 .03 .85 .04 .85 .03 .89 .04 .85 .03 .87 .03 .88 .03 .86 .03 .88 .02 .89 .04

c0_100_10 .77 .04 .74 .03 .76 .03 .77 .03 .64 .07 .71 .05 .79 .03 .71 .05 .78 .01 .77 .06 elusage .90 .05 .84 .06 .84 .06 .89 .04 .84 .06 .87 .05 .89 .04 .84 .06 .89 .04 .92 .04 diggle_table .65 .14 .61 .07 .57 .08 .65 .11 .60 .09 .58 .07 .57 .13 .57 .06 .60 .13 .68 .07

baskball .70 .02 .68 .04 .68 .02 .71 .03 .71 .03 .63 .02 .66 .05 .69 .04 .68 .02 .72 .06 michiganacc .72 .06 .67 .06 .71 .05 .71 .04 .67 .06 .66 .07 .71 .05 .69 .04 .71 .05 .73 .05 election2000 .92 .04 .90 .04 .91 .03 .92 .02 .91 .03 .92 .01 .90 .07 .92 .02 .92 .03 .94 .02

Algorithm 1 Customized Mirror-Prox algorithm for solving the saddle-point problem (13)

Initialize .0/ and S.0/ for t D 0; 1; : : : ; T do

Gradient step:

O .t/ D Proj E .t/ tr . .t/; S.t//

Os.t/ i D Proj Si s.t/ i C ˇtrsi . .t/; S.t//

for all i 2 Œm Extra-gradient step:

.t C1/ D Proj E .t/ tr .O .t/; OS.t//

s.t C1/ i D Proj Si s.t/ i C ˇtrsi .O .t/; OS.t//

for all i 2 Œm end for

O D PT t D1 t O .t/ı PT t D1 t OS D PT t D1 ˇt OS.t/ı PT t D1 ˇt Round O to f0; 1gp n

m P i2Œm yi.Osix T i / ˇ O

Algorithm 2 .Proj E/ Projection onto the set Ej j 2 Rn W ji 2 Œ0; 1 ; k j k1 k

Input: j 2 Rn and a small tolerance tol.

Clip j;i to Œ0; 1 for all i 2 Œn Return j if 1T j k

Binary search to ﬁnd tol-solution Set low D 1T j k =n Set high D maxi2Œn j;i k=n while low high do

Set D .low C high/=2 Compute O j W 8i 2 Œn ; O j;i D j;i Clip O j to Œ0; 1 n

if j1T O j kj < tol then

return O j else if 1T O j > k then

Set low D .low C high/=2 else

Set high D .low C high/=2 end if end while

In order to use Algorithm 1, we need to ﬁnd the partial gradients of . ; S/, which are given as

r . ; S/ D 1

i2Œm yi six T i ˇ X

i2Œm yi six T i ;

rsi . ; S/ D 1

i2Œm yi six T i ˇ xi; i 2 Œm :

There are two projection operators in Algorithm 1. The ﬁrst one projects some 2 Rp n onto

E f 2 Rp n W 2 Œ0; 1 p n; k k1;1 kg :

This can be done efﬁciently by Algorithm 2. Essentially, we perform p independent projections, each for one row of using a bi-section type of algorithm [39, 40, 41]. We have the following result.

Figure 2: SMa LL applied to the Breast Cancer dataset with k D 3 and p D 2. The blue and orange dots represent the test instances from the two classes. The plots show the kernel density estimates and the actual values of the non-zero features in each prototype, as well at the ﬁnal predictor result.

Proposition 4. Algorithm 2 computes, up to a speciﬁed tolerance tol, the projection of any 2 Rp n onto E in O .log2.1=tol// time, where tol is the input precision for bisection.

There are two cases for the projection of si 2 Rp onto the set Si. For i 2 IC, we only need to project si;j.i/ onto the interval Œ 1; 0 and set si;j D 0 for all j j.i/. For i 2 I , the projection algorithm is similar to Algorithm 2, and we omit the details here. The step sizes t and ˇt can be set according to the guidelines described in [37, 38], based on the smoothness properties of the function . ; S/. In practice, we follow the adaptive tuning procedure developed in [42].

4 Experiments

We demonstrate the merits of SMa LL via an extensive set of experiments. We start with an intuition into how the class of sparse multiprototype linear predictors differs from standard model classes.

Figure 1 is a visualization of the decision surface of different types of classiﬁers on the 2-dimensional chscase funds toy dataset, obtained from Open ML. The two classes are shown in red and blue, with training data in solid shade and test data in translucent shade. The color of each band indicates the gradation in the conﬁdence of prediction - each classiﬁer is more conﬁdent in the darker regions and less conﬁdent in the lighter regions. The 2-prototype linear predictor attains the best test accuracy on this toy problem (0:73). Note that some of the examples are highlighted by a black rectangle - the linear classiﬁers (logistic regression and linear SVM) could not distinguish between these examples, whereas the 2-prototype linear predictor was able to segregate and assign them to different bands.

4.1 Low-dimensional Datasets Without Sparsity

We now compare the accuracy of SMa LL with k D n (no sparsity) to the accuracy of other standard classiﬁcation algorithms, on several low-dimensional (n 20) binary classiﬁcation datasets from the Open ML repository. We experimented with Open ML data for two main reasons: (a) it contains many preprocessed binary datasets, and (b) the datasets come from diverse domains. The methods that we compare against are linear SVM (LSVM), SVM with non-linear kernels such as radial

Figure 3: Comparison on high dimensional (n >D 50) Open ML data from the Fri series. Each stacked bar shows average test accuracy on left, and the total number of selected features on right.

basis function, polynomial, and sigmoid (RSVM), Logistic Regression (LR), Decision Trees (DT), Random Forest (RF), k-Nearest Neighbor (k NN), Gaussian Process (GP), Gradient Boosting (GB), and Ada Boost (AB). All the datasets were normalized to make each feature have zero mean and unit variance. Since the datasets do not specify separate train, validation, and test sets, we measure test accuracy by averaging over ﬁve random train-test splits. Since we are interested in extreme sparsity, we pre-clustered the positive examples into p D 2 clusters, and initialized the prototypes with the cluster centers. We determined hyperparameters by 5-fold cross-validation. The coefﬁcient of the error term C in LSVM and 2-regularized LR was selected from f0:1; 1; 10; 100g. In the case of RSVM, we also added 0:01 to the search set for C, and chose the best kernel between a radial basis function (RBF), polynomials of degree 2 and 3, and sigmoid. For the ensemble methods (RF, AB, GB), the number of base predictors was selected from the set f10; 20; 50g. The maximum number of features for RF estimators was optimized over the square root and the log selection criteria. We also found best validation parameters for DT (gini or entropy for attribute selection), k NN (1, 3, 5 or 7 neighbors), and GP (RBF kernel scaled with scaled by a coefﬁcient in the set f0:1; 1:0; 5g and dot product kernel with inhomogeneity parameter set to 1). Finally, for our method SMa LL, we ﬁxed D 0:1 and t D 0:01, and searched over ˇt D ˇ 2 f0:01; 0:001g.

Table 1 shows the test accuracy for the different algorithms on different datasets. As seen from the table, SMa LL with k D n generally performed extremely well on most of these datasets. This substantiates the practicality of SMa LL in the low dimensional regime.

4.2 Higher-dimensional Datasets with Sparsity

We now describe results with higher dimensional data, where feature selection becomes especially critical. To substantiate our claim that SMa LL produces an interpretable model, we ran SMa LL on the Breast Cancer dataset with k D 3 and p D 2 (two prototypes, three non-zero elements in each). Figure 2 shows the kernel density estimates and the actual values of the selected features in each prototype, and the summary of our predictor. Note that the feature perimeter_worst appears in both prototypes. As the rightmost plot shows, the predictor output provides a good separation of the test data, and SMa LL registered a test accuracy of over 94%. It is straightforward to understand how the resulting classiﬁer reaches its decisions: which features it relies on and how those features interact.

Next, we compare SMa LL with 8 other methods. Six of these methods induce sparsity by minimizing an 1-regularized loss function. These methods minimize one of the three empirical loss functions (hinge loss, log loss, and the binary-classiﬁcation Huber loss), regularized by either an 1 or an elastic net penalty (i.e. 1 and 2). We refer to these as L1Hi ( 1, hinge), L1L ( 1, log), L1Hu ( 1, Huber), En Hi (elastic net, hinge), ENL (elastic net, log) and ENHu (elastic net, huber). We also compare

Figure 4: The big picture. The plot depicts the performance of SMa LL compared to both the standard classiﬁcation algorithms and the sparse baselines on the fri_c0_1000_50 dataset. The number atop each bar is the average number of features selected by that algorithm across 5 runs.

with two state-of-the-art methods for the scarce-resource setting. Proto NN [1] is a modern take on nearest neighbor classiﬁers, while Bonsai [2] is a sophisticated enhancement of a small decision tree. Note that while we can explicitly control the amount of sparsity in SMa LL, Proto NN, and Bonsai, the methods that use 1 or elastic net regularization do not have this ﬂexibility. Therefore, in order to get the different baselines on the same footing, we devised the following empirical methodology. We speciﬁed p k D 6 features as the desired sparsity, and modulated each linear baseline to yield nearly these many features. We trained each of the linear baselines by setting a high value of the 1 coefﬁcient and selected the features with the largest absolute values. Then, we retrained the classiﬁer using only the selected features, using the same loss (hinge, loss, or log) and an 2 regularization. Our procedure ensured that each baseline beneﬁted, in effect, from an elastic net-like regularization while having the most important features at its disposal. For the SMa LL classiﬁer, we ﬁxed k D 3 and p D 2. In practice, this setting will be application speciﬁc (e.g., it would likely depend on the budget). As before, since the original dataset did not specify a train-test split, our results were averaged over ﬁve random splits. The parameters for each method were tuned using 5-fold cross-validation. We ﬁxed D 0:1 and performed a joint search over t 2 f0:1; 1e 2; 1e 3g and ˇt 2 f1e 3; 1e 4g. For all the baselines, we optimized the cross validation error over the 1 regularization coefﬁcients in the set f1e 1; 1e 2; 1e 3; 1e 4g. Moreover, in case of elastic net, the ratio of the 1 coefﬁcient to the 2 coefﬁcient was set to 1. The depth of the estimators in Bonsai was selected from f2; 3; 4g. Finally, the dimensionality of projection in Proto NN was searched over f5; 10; 15; 20g.

Figure 3 provides strong empirical evidence that SMa LL compares favorably to the baselines on several high dimensional Open ML datasets belonging to the Fri series. Speciﬁcally, the ﬁrst number in each dataset name indicates the number of examples, and the second the dimensionality of the dataset. Note that in case of SMa LL, some features might be selected in more than one prototype. Therefore, to be fair to the other methods, we included the multiplicity while computing the total feature count. We observe that, on all but one of these datasets, SMa LL outperformed the Proto NN and Bonsai models at the same level of sparsity, and the gap between SMa LL and these methods generally turned out to be huge. Moreover, compared to the linear baselines, SMa LL achieved consistently better performance at much sparser levels. This shows the promise of SMa LL toward achieving succinct yet accurate predictors in the high dimensional regime. The merits of SMa LL are further reinforced in Fig. 4 that shows the accuracy-sparsity trade-offs. We observe that just with 6 features, SMa LL provides better test accuracy compared to all the baselines but GB and AB. This shows the potential of SMa LL as a practical algorithm for resource deﬁcient environments.

[1] C. Gupta, A. S. Suggala, A. Goyal, H. V. Simhadri, B. Paranjape, A. Kumar, S. Goyal, R. Udupa, M. Varma, and P. Jain. Proto NN: compressed and accurate k NN for resource-scarce devices. In ICML, pages 1331 1340, 2017.

[2] A. Kumar, S. Goyal, and M. Varma. Resource-efﬁcient machine learning in 2 kb ram for the internet of things. In ICML, pages 1935 1944, 2017.

[3] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984.

[4] J. R. Hauser, O. Toubia, T. Evgeniou, R. Befurt, and D. Dzyabura. Disjunctions of conjunctions, cognitive simplicity, and consideration sets. Journal of Marketing Research, 47(3):485 496, 2010.

[5] T. Wang, C. Rudin, F. Doshi, Y. Liu, E. Klampﬂ, and P. Mac Neille. Bayesian rule sets for interpretable classiﬁcation, with application to context-aware recommender systems. JMLR, 18(70):1 37, 2017.

[6] O. Cord. Genetic fuzzy systems: evolutionary tuning and learning of fuzzy knowledge bases, volume 19. World Scientiﬁc, 2001.

[7] A. Blum, M. Furst, J. Jackson, M. Kearns, Y. Mansour, and S. Rudich. Weakly learning dnf and characterizing statistical query learning using fourier analysis. In Proceedings of the Twenty-sixth Annual ACM Symposium on Theory of Computing (STOC), pages 253 262, 1994.

[8] Y. Mansour. An o (nlog log n) learning algorithm for dnf under the uniform distribution. Journal of Computer and System Sciences, 50(3):543 550, 1995.

[9] J. C. Jackson. An efﬁcient membership-query algorithm for learning dnf with respect to the uniform distribution. Journal of Computer and System Sciences, 55(3):414 440, 1997.

[10] K. Verbeurgt. Learning suc-classes of monotone dnf on the uniform distribution. In Proceedings of the Ninth Conference on Algorithmic Learning Theory, pages 385 399, 1998.

[11] N. H. Bshouty, J. C. Jackson, and C. Tamon. More efﬁcient pac-learning of dnf with membership queries under the uniform distribution. In Computational Learning Theory (COLT), pages 286 295, 1999.

[12] Y. Sakai and A. Maruoka. Learning monotone log-term dnf formulas under the uniform distribution. Theory of Computing Systems, 33(1):17 33, 2000.

[13] R. A. Servedio. On learning monotone dnf under product distributions. Information and Computation, 193(1):57 74, 2004.

[14] N. H. Bshouty, E. Mossel, R. O Donnell, and R. A. Servedio. Learning dnf from random walks. Journal of Computer and System Sciences, 71(3):250 265, 2005.

[15] V. Feldman. Learning DNF expressions from fourier spectrum. In Conference on Learning Theory (COLT), pages 17.1 17.19, 2012.

[16] A. R. Klivans and R. A. Servedio. Learning dnf in time 2o (n1/3). Journal of Computer and System Sciences, 68(2):303 318, 2004.

[17] S. Khot and R. Saket. Hardness of minimizing and learning dnf expressions. In Foundations of Computer Science (FOCS), pages 231 240, 2008.

[18] F. Aiolli and A. Sperduti. Multiclass classiﬁcation with multi-prototype support vector machines. JMLR, 6:817 850, 2005.

[19] O. Dekel, S. Shalev-Shwartz, and Y. Singer. The Forgetron: A kernel-based perceptron on a budget. SIAM Journal on Computing, 37(5):1342 1372, 2008.

[20] O. Dekel and Y. Singer. Support vector machines on a budget. In NIPS, pages 345 352, 2007.

[21] M. Kusner, S. Tyree, K. Q. Weinberger, and K. Agrawal. Stochastic neighbor compression. In ICML, pages 622 630, 2014.

[22] K. Zhong, R. Guo, S. Kumar, B. Yan, D. Simcha, and I. Dhillon. Fast Classiﬁcation with Binary Prototypes. In AISTATS, pages 1255 1263, 2017.

[23] M. Pilanci and M. J. Wainwright. Sparse learning via Boolean relaxations. Mathematical Programming, 151:63 87, 2015.

[24] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (methodological), 58(1):267 288, 1996.

[25] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B (methodological), 67(2):301 320, 2005.

[26] V. K. Garg, L. Xiao, and O. Dekel. Sparse Multiprototype Classiﬁcation. In UAI, 2018.

[27] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR, 2016.

[28] F. Nan, J. Wang, and V. Saligrama. Pruning random forests for prediction on a budget. In NIPS, pages 2334 2342, 2016.

[29] J.-H. Luo, J. Wu, and W. Lin. Thinet: A ﬁlter level pruning method for deep neural network compression. In ICCV, pages 5068 5076, 2017.

[30] T. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In ICASSP, pages 6655 6659, 2013.

[31] P. Nakkiran, R. Alvarez, R. Prabhavalkar, and C. Parada. Compressing deep neural networks using a rank-constrained topology. In Sixteenth Annual Conference of the International Speech Communication Association, 2015.

[32] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. In ICML, pages 2285 2294, 2015.

[33] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks. In NIPS, pages 4107 4115, 2016.

[34] D. Bertsimas and R. Shioda. Classiﬁcation and regression via integer optimization. Operations Research, 55(2):252 271, 2007.

[35] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[36] M. Tan, I. W. Tsang, and Li Wang. Towards ultrahigh dimensional feature selection for big data. JMLR, 15(1):1371 1429, 2014.

[37] A. Nemirovski. Prox-method with rate of convergence O.1=t/ for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229 251, 2004.

[38] A. Juditsky and A. Nemirovski. First-order methods for nonsmooth convex large-scale optimization, II: Utilizing problems s structure. In S. Sra, S. Nowozin, and S. J. Wright, editors, Optimization for Machine Learning, chapter 6, pages 149 184. The MIT Press, Cambridge, MA., 2011.

[39] P. Brucker. An O.n/ algorithm for quadratic Knapsack problems. Operations Research Letters, 3(3):163 166, 1984.

[40] P. M. Pardalos and N. Kovoor. An algorithm for a singly constrained class of quadratic programs subject to upper and lower bounds. Mathematical Programming, 46:321 328, 1990.

[41] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Efﬁcient projection onto the 1-ball for learning in high dimensions. In ICML, pages 272 279, 2008.

[42] A. Jalali, M. Fazel, and L. Xiao. Variational Gram functions: Convex analysis and optimization. SIAM Journal on Optimization, 27(4):2634 2661, 2017.