# learning_sparse_lowthreshold_linear_classifiers__3fab9d55.pdf Journal of Machine Learning Research 16 (2015) 1275-1304 Submitted 12/12; Revised 3/15; Published 7/15 Learning Sparse Low-Threshold Linear Classifiers Sivan Sabato sabatos@cs.bgu.ac.il Ben-Gurion University of the Negev Beer Sheva, 8410501, Israel Shai Shalev-Shwartz shais@cs.huji.ac.il Benin School of Computer Science and Engineering The Hebrew University Givat Ram, Jerusalem 91904, Israel Nathan Srebro nati@ttic.edu Toyota Technological Institute at Chicago 6045 S. Kenwood Ave. Chicago, IL 60637 Daniel Hsu djhsu@cs.columbia.edu Department of Computer Science Columbia University 1214 Amsterdam Avenue, #0401 New York, NY 10027 Tong Zhang tzhang@stat.rutgers.edu Department of Statistics Rutgers University Piscataway, NJ 08854 Editor: Koby Crammer We consider the problem of learning a non-negative linear classifier with a ℓ1-norm of at most k, and a fixed threshold, under the hinge-loss. This problem generalizes the problem of learning a k-monotone disjunction. We prove that we can learn efficiently in this setting, at a rate which is linear in both k and the size of the threshold, and that this is the best possible rate. We provide an efficient online learning algorithm that achieves the optimal rate, and show that in the batch case, empirical risk minimization achieves this rate as well. The rates we show are tighter than the uniform convergence rate, which grows with k2. Keywords: linear classifiers, monotone disjunctions, online learning, empirical risk minimization, uniform convergence 1. Introduction We consider the problem of learning non-negative, low-ℓ1-norm linear classifiers with a fixed (or bounded) threshold. That is, we consider hypothesis classes over instances x [0, 1]d of the following form: Hk,θ = n x 7 w, x θ w Rd +, w 1 k o , (1) c 2015 Sivan Sabato, Shai Shalev-Shwartz, Nathan Srebro, Daniel Hsu, and Tong Zhang. Sabato, Shalev-Shwartz, Srebro, Hsu, and Zhang where we associate each (real valued) linear predictor in Hk,θ with a binary classifier:1 x 7 sign( w, x θ) = ( 1 if w, x > θ 1 if w, x < θ . (2) Note that the hypothesis class is specified by both the ℓ1-norm constraint k and the fixed threshold θ. In fact, the main challenge here is to understand how the complexity of learning Hk,θ changes with θ. The classes Hk,θ can be seen as a generalization and extension of the class of k-monotonedisjunctions and r-of-k-formulas. Considering binary instances x {0, 1}d, the class of k-monotone-disjunctions corresponds to linear classifiers with binary weights, w {0, 1}d, with w 1 k and a fixed threshold of θ = 1 2. That is, a restriction of Hk, 1 2 to integer weights and integer instances. More generally, the class of r-of-k formulas (i.e., formulas which are true if at least r of a specified k variables are true) corresponds to a similar restriction, but with a threshold of θ = r 1 2. Studying k-disjunctions and r-of-k formulas, Littlestone (1988) presented the efficient Winnow online learning rule, which admits an online mistake bound (in the separable case) of O(k log d) for k-disjunctions and O(rk log d) for r-of-k-formulas. In fact, in this analysis, Littlestone considered also the more general case of real-valued weights, corresponding to the class Hk,θ over binary instances x {0, 1}d and for separable data, and showed that Winnow enjoys a mistake bound of O(θk log d) in this case as well. By applying a standard online-to-batch conversion (see, e.g., Shalev-Shwartz, 2012), one can also achieve a sample complexity upper bound of O(θk log(d)/ϵ) for batch supervised learning of this class in the separable case. In this paper, we consider the more general case, where the instances x can also be fractional, i.e., where x [0, 1]d and in the agnostic, non-separable, case. It should be noted that Littlestone (1989) also studied a limited version of the non-separable setting. In order to move on to the fractional and agnostic analysis, we must clarify the loss function we will use, and the related issue of separation with a margin. When the instances x and weight vectors w are integer-valued, we have that w, x is always integer. Therefore, if positive and negative instances are at all separated by some predictor w (i.e., sign( w, x θ) = y where y { 1} denotes the target label), they are necessarily separated by a margin of half. That is, setting θ = r 1 2 for an integer r, we have y( w, x θ) 1 2. Moving to fractional instances and weight vectors, we need to require such a margin explicitly. And if considering the agnostic case, we must account not only for misclassified points, but also for margin violations. As is standard both in online learning (e.g., the agnostic Perceptron guarantee of Gentile 2003) and in statistical learning using convex optimization (e.g., support vector machines), we will rely on the hinge loss at margin half,2 which is equal to: 2 1 +. The hinge loss is a convex upper bound to the zero-one loss (that is, the misclassification rate) and so obtaining learning guarantees for it translates to guarantees on the misclassification error rate. 1. The value of the mapping when w, x = θ can be arbitrary, as our results and our analysis do not depend on it. 2. Measuring the hinge loss at a margin of half rather than a margin of one is an arbitrary choice, which corresponds to a scaling by a factor of two, which fits better with the integer case discussed above. Learning Sparse Low-Threshold Linear Classifiers Phrasing the problem as hinge-loss minimization over the hypothesis class Hk,θ, we can use Online Exponentiated Gradient (EG) (Kivinen and Warmuth, 1994) or Online Mirror Descent (MD) (e.g., Shalev-Shwartz, 2007; Srebro et al., 2011), which rely only on the ℓ1bound and hold for any threshold. In the statistical setting, we can use Empirical Risk Minimization (ERM), in this case minimizing the empirical hinge loss, and rely on uniform concentration for bounded ℓ1 predictors (Schapire et al., 1997; Zhang, 2002; Kakade et al., 2009), again regardless of the threshold. However, these approaches yield mistake bounds or sample complexities that scale quadratically with the ℓ1 norm, that is with k2 rather than with θk. Since the relevant range of thresholds is 0 θ k, a scaling of θk is always better than k2. When θ is large, that is, roughly k/2, the Winnow bound agrees with the EG and MD bounds. But when we consider classification with a small threshold (for instance, θ = 1 2) in the case of disjunctions, the Winnow analysis clarifies that this is a much simpler class, with a resulting smaller mistake bound and sample complexity, scaling with k rather than with k2. This distinction is lost in the EG and MD analyses, and in the ERM guarantee based on uniform convergence arguments. For small thresholds, where θ = O(1), the difference between these analyses and the Winnow guarantee is a factor of k. Our starting point and our main motivation for this paper is to understand this gap between the EG, MD and uniform concentration analyses and the Winnow analysis. Is this gap an artifact of the integer domain or the separability assumption? Or can we obtain guarantees that scale as θk rather then k2 also in the non-integer non-separable case? In the statistical setting, must we use an online algorithm (such as Winnow) and an onlineto-batch conversion in order to ensure a sample complexity that scales with θk, or can we obtain the same sample complexity also with ERM? This is an important question, since the ERM algorithm is considered the canonical batch learning algorithm, and understanding its scope and limitations is of theoretical and practical interest. A related question is whether it is possible to establish uniform convergence guarantees with a dependence on θk rather then k2, or do the learning guarantees here arise from a more delicate argument. If an ERM algorithm obtains similar bounds to the ones of the online algorithm with online-to-batch convergence, then any algorithm that can minimize the risk on the sample can be used for learning in this setting. Moreover, this advances our theoretical understanding of the limitations and scope of the canonical ERM algorithm. The gap between the Winnow analysis and the more general ℓ1-norm-based analyses is particularly interesting since we know that, in a sense, online mirror descent always provides the best possible rates in the online setting (Srebro et al., 2011). It is thus desirable to understand whether mirror descent is required here to achieve the best rates, or can it be replaced by a simple regularized loss minimization. Answering the above questions, our main contributions are: We provide a variant of online Exponentiated Gradient, for which we establish a regret bound of O( p θk log(d)T) for Hk,θ, improving on the O( p k2 log(d)T) regret guarantee ensured by the standard EG analysis. We do so using a more refined analysis based on local norms. Using a standard online-to-batch conversion, this yields a sample complexity of O(θk log(d)/ϵ2) in the statistical setting. This result is given in Corollary 5, Section 3. Sabato, Shalev-Shwartz, Srebro, Hsu, and Zhang In the statistical agnostic PAC setting, we show that the rate of uniform convergence of the empirical hinge loss of predictors in Hk,θ is indeed Ω( p k2/m) where m is the sample size, corresponding to a sample complexity of Ω(k2/ϵ2), even when θ is small. We show this in Theorem 21 in Section 5. Nevertheless, we establish a learning guarantee for empirical risk minimization which matches the online-to-batch guarantee above (up to logarithmic factors), and ensures a sample complexity of O(θk log(d)/ϵ2) also when using ERM. This is obtained by a more delicate local analysis, focusing on predictors which might be chosen as empirical risk minimizers, rather than a uniform analysis over the entire class Hk,θ. The result is given in Theorem 6, Section 4. We also establish a matching lower bound (up to logarithmic factors) of Ω(θk/ϵ2) on the required sample complexity for learning Hk,θ in the statistical setting. This shows that our ERM analysis is tight (up to logarithmic factors), and that, furthermore, the regret guarantee we obtain in the online setting is likewise tight up to logarithmic factors. This lower bound is provided in Theorem 17, Section 5. 1.1 Related Prior Work We discussed Littlestone s work on Winnow at length above. In our notation, Littlestone (1988) established a mistake bound (that is, a regret guarantee in the separable case, where there exists a predictor with zero hinge loss) of O(kθ log(d)) for Hk,θ, when the instances are integer x {0, 1}d. Littlestone also established a lower bound of k log(d/k) on the VC-dimension of k-monotone-disjunctions, corresponding to the case θ = 1 2, thus implying a Ω(k log(d/k)/ϵ2) lower bound on learning Hk, 1 2 . However, the question of obtaining a lower bound for other values of the threshold θ was left open by Littlestone. In the agnostic case, Auer and Warmuth (1998) studied the discrete problem of kmonotone disjunctions, corresponding to Hk, 1 2 with integer instances x {0, 1}d and integer weights w {0, 1}d, under the attribute loss, defined as the number of variables in the assignment that need to be flipped in order to make the predicted label correct. They provide an online algorithm with an expected mistake bound of A + 2 p A k ln(d/k) + O(k ln(d/k)), where A is the best possible attribute loss for the given online sequence. An online-to-batch conversion thus achieves here a zero-one loss which converges to the optimal attribute loss on this problem at the rate of O(k ln(d/k)/ϵ2). Since the attribute loss is upper bounded by the hinge loss, a similar result, in which A is replaced with the optimal hingeloss for the given sequence, also holds for the same algorithm. This establishes an agnostic guarantee of the desired form, for a threshold of θ = 1 2, and when both the instances and weight vectors are integers. 2. Notations and Definitions For a real number q, we denote its positive part by [q]+ := max{0, q}. We denote universal positive constants by C. The value of C may be different between statements or even between lines of the same expression. We denote by Rd + the non-negative orthant in Rd. The all-zero vector in Rd is denoted by 0. For an integer n, we denote [n] = {1, . . . , n}. For a vector x Rd, and i [d], x[i] denotes the i th coordinate of x. Learning Sparse Low-Threshold Linear Classifiers We will slightly overload notation and use Hk,θ to denote both the set of linear predictors x 7 w, x θ and the set of vectors w Rd + such that w 1 k. We will use w to denote both the vector and the linear predictor associated with it. For convenience we will work with half the hinge loss at margin half, and denote this loss, for a predictor w Hk,θ, for θ [0, k], by ℓθ(x, y, w) := h1 2 y( w, x θ) i The subscript θ will sometimes be omitted when it is clear from context. We term ℓθ the Winnow loss. Echoing the half-integer thresholds for k-monotone-disjunctions, r-of-k formulas, and the discrete case more generally, we will denote r = θ + 1 2, so that θ = r 1 2. In the discrete case r is integer, but in this paper 1 2 can also be fractional. We will also sometimes refer to r = 1 2 θ. Note that r can be negative. In the statistical setting, we refer to some fixed and unknown distribution D over instance-label pairs (X, Y ), where we assume access to a sample (training set) drawn i.i.d. from D, and the objective is to minimize the expected loss: ℓθ(w, D) = EX,Y D[ℓθ(X, Y, w)]. (3) When the distribution D is clear from context, we simply write ℓθ(w), and we might also omit the subscript θ. For fixed D and θ we let w argminw Hk,θ E[ℓ(X, Y, w)]. This is the true minimizer of the loss on the distribution. For a set of predictors (hypothesis class) H, we denote ℓ θ(H, D) := minw H ℓθ(w, D). For a sample S ([0, 1]d { 1}) , we use the notation ˆES[f(X, Y )] = 1 i=1 f(xi, yi) (4) and again sometimes drop the subscript S when it is clear from context. For a fixed sample S, and fixed θ and D, the empirical loss of a predictor w on the sample is denoted ˆℓ(w) = ˆES[ℓθ(X, Y, w)]. 2.1 Rademacher Complexity The empirical Rademacher complexity of the Winnow loss for a class W Rd with respect to a sample S = ((x1, y1), . . . , (xm, ym)) ([0, 1]d { 1})m is R(W, S) := 2 i=1 ϵiℓ(xi, yi, w) where the expectation is over the Rademacher random variables ϵ1, . . . , ϵm. These are defined as independent random variables drawn uniformly from { 1}. The average Rademacher complexity of the Winnow loss for a class W Rd with respect to a distribution D over [0, 1]d { 1} is denoted by Rm(W, D) := ES Dm[R(W, S)]. (6) Sabato, Shalev-Shwartz, Srebro, Hsu, and Zhang We also define the average Rademacher complexity of W with respect to the linear loss by RL m(W, D) := 2 i=1 ϵi Yi w, Xi where the expectation is over ϵ1, . . . , ϵm as above and ((X1, Y1), . . . , (Xm, Ym)) Dm. 2.2 Probability Tools We use the following variation on Bernstein s inequality. Proposition 1 Let B > 0. For a random variable X [0, B], δ (0, 1) and n an integer, with probability at least 1 δ over n i.i.d. draws of X, ˆE[X] E[X] 2B B , ln(1/δ) Proof By Bernstein s inequality (Bernstein, 1946), if Z1, . . . , Zn are i.i.d. draws from a random variable Z [ 1, 1] such that E[Z] = 0, and Var[Z2] = σ2, then P[ˆE[Z] ϵ] exp nϵ2 2(σ2 + ϵ/3) Fix δ (0, 1) and an integer n. If ln(1/δ)/n σ2 then let ϵ = 2 q n σ2 2σ2. In this case nϵ2 2σ2 + 2ϵ/3 nϵ2 10σ2/3 ln(1/δ). If ln(1/δ)/n > σ2 then let ϵ = 2 ln(1/δ)/n. Then σ2 ln(1/δ)/n = ϵ/2. In this case 2σ2 + 2ϵ/3 nϵ2 5ϵ/3 nϵ/4 = ln(1/δ). In both cases, the RHS of Eq. (8) is at most δ. Therefore, with probability at least 1 δ, n max σ2, ln(1/δ) where the last inequality follows from the range of Z. Now, for a random variable X with range in [0, B], let Z = (X E[X])/B. We have σ2 = Var[Z] = Var[X]/B2 E[X2/B2] E[X/B], where the last inequality follows from the range of X. Therefore B , ln(1/δ) The same bound on E[X] ˆE[X] can be derived similarly by considering Z = (E[X] X)/B. We further use the following fact, which bounds the ratio between the empirical fraction of positive or negative labels and their true probabilities. We will apply this fact to make sure that enough negative and positive labels can be found in a random sample. Learning Sparse Low-Threshold Linear Classifiers Proposition 2 Let B be a binomial random variable, B Binomial(m, p). If p 8 ln(1/δ)/m then with probability of at least 1 δ, B mp/2. Proof This follows from a multiplicative Chernoffbound (Angluin and Valiant, 1979). 3. Online Algorithm Consider the following algorithm: Unnormalized Exponentiated Gradient (unnormalized-EG) parameters: η, λ > 0 input: z1, . . . , z T Rd initialize: w1 = (λ, . . . , λ) Rd update rule: i, wt+1[i] = wt[i]e ηzt[i] The following theorem provides a regret bound with local-norms for the unnormalized EG algorithm (for a proof, see Theorem 2.23 of Shalev-Shwartz, 2012). Theorem 3 Assume that the unnormalized EG algorithm is run on a sequence of vectors such that for all t, i we have ηzt[i] 1. Then, for all u Rd +, t=1 wt u, zt dλ + Pd i=1 u[i] ln(u[i]/(e λ)) i=1 wt[i]zt[i]2 . Now, let us apply it to a case in which we have a sequence of convex functions f1, . . . , f T , and zt is the sub-gradient of ft at wt. Additionally, set λ = k/d and consider u s.t. u 1 k. We obtain the following. Theorem 4 Assume that the unnormalized EG algorithm is run with λ = k/d. Assume that for all t, we have zt ft(wt), for some convex function ft. Further assume that for all t, i we have ηzt[i] 1, and that for some positive constants α, β, it holds that η = p k ln(d)/(βT), T 4α2k ln(d)/β, and i=1 wt[i]zt[i]2 αft(wt) + β . (9) Then, for all u Rd +, with u 1 k we have t=1 ft(u) + t=1 ft(u) + p 4βk ln(d)T + 4αk ln(d). Proof Using the convexity of ft and the assumption that zt ft(wt) we have that t=1 (ft(wt) ft(u)) t=1 wt u, zt . Sabato, Shalev-Shwartz, Srebro, Hsu, and Zhang Combining with Theorem 3 we obtain t=1 (ft(wt) ft(u)) dλ + Pd i=1 u[i] ln(u[i]/(e λ)) i=1 wt[i]zt[i]2 . Using the assumption in Eq. (9), the definition of λ = k/d, and the assumptions on u, we obtain T X t=1 (ft(wt) ft(u)) k ln(d) η + ηβT + ηα t=1 ft(wt) . Rearranging the above we conclude that t=1 ft(wt) 1 1 αη t=1 ft(u) + k ln(d) Now, since 1/(1 x) 1 + 2x for x [0, 1/2] and αη 1 2, we conclude, by substituting for the definition of η, that t=1 ft(u) + 2 p k ln(d)βT + 2α t=1 ft(u) + 4αk ln(d). We can now derive the desired regret bound for our algorithm. We also provide a bound for the statistical setting, using online-to-batch conversion. Corollary 5 Let ℓ ℓθ for some θ [0, k]. Fix any sequence (x1, y1), (x2, y2), . . . , (x T , y T ) [0, 1]d { 1} and assume T 4k ln(d)/r. Suppose the unnormalized EG algorithm listed in Section 3 is run using η := q r T , λ := k/d, and any zt wℓ(xt, yt, wt) for all t. Define LUEG := PT t=1 ℓ(xt, yt, wt), let L(u) := PT t=1 ℓ(xt, yt, u), and let u argmin L(u). Then the following regret bound holds. LUEG L(u ) p 16rk ln(d)T + 4k ln(d). (10) Moreover, for m 1, assume that a random sample S = ((x1, y1), (x2, y2), . . . , (xm, ym)) is drawn i.i.d. from an unknown distribution D over [0, 1]d { 1}. Then there exists an online-to-batch conversion of the UEG algorithm that takes S as input and outputs w, such that E[ℓ( w, D)] ℓ(w , D) + m + 4k ln(d) where the expectation is over the random draw of S. Proof Every sub-gradient zt wℓ(xt, yt, wt) is of the form zt = atxt for some at { 1, 0, +1}. Since 0 xt[i] 1 and wt[i] 0 for all i, it follows that Pd i=1 wt[i]zt[i]2 = |at| Pd i=1 w[i]xt[i]2 |at| wt, xt . Now consider three disjoint cases. Case 1: wt, xt r. Then Pd i=1 wt[i]zt[i]2 wt, xt r. Learning Sparse Low-Threshold Linear Classifiers Case 2: wt, xt > r and yt = 1. Then at = 0 and Pd i=1 wt[i]zt[i]2 = 0. Case 3: wt, xt > r and yt = 1. Then Pd i=1 wt[i]zt[i]2 wt, xt [r + wt, xt ]+ r [r + wt, xt ]+ + r. In all three cases, the final upper bound on Pd i=1 wt[i]zt[i]2 is at most ℓ(xt, yt, wt) + r. Therefore, Eq. (9) from Theorem 4 is satisfied with ft(w) := ℓ(xt, yt, w), α := 1, and β := r. From Theorem 4 with this choice of ft and the given settings of η, λ, and zt, we get that for any u such that u 1 k, LUEG L(u) + L(u) 4rk ln(d)T + 4k ln(d). (12) Observing that L(u ) L(0) r T, we conclude the regret bound in Eq. (10). For the statistical setting, a simple approach for online-to-batch conversion is to run the UEG algorithm as detailed in Corollary 5, with T = m, and to return the average predictor w = 1 m P i [m] wi. By standard analysis (e.g., Shalev-Shwartz, 2012, Theorem 5.1), E[ℓθ( w, D)] 1 m E[LUEG], where the expectation is over the random draw of S. Setting u = w , Eq. (12) gives E[ℓθ( w, D)] E ˆℓ(w )2 4k ln(d) m + 4k ln(d) Since E[ˆℓ(w )] = ℓ(w ) and ℓ(w ) r, Eq. (11) follows. In the online setting a simple version of the canonical mirror descent algorithm thus achieves the postulated regret bound of O( p rk log(d)T) O( p θk log(d)T). For the statistical setting, an online-to-batch conversion provides the desired rate of O(rk log(d)/ϵ2) O(θk log(d)/ϵ2). Is this online-to-batch approach necessary, or is a similar rate for the statistical setting achievable also using standard ERM? Moreover, this online-to-batch approach leads to an improper algorithm, that is, the output w might not be in Hk,θ, since it might not satisfy the norm bound. In the next section we show that standard, proper, ERM, leads to the same learning rate. 4. ERM Upper Bound We now proceed to analyze the performance of empirical risk minimization in the statistical batch setting. As above, assume a random sample S = ((x1, y1), . . . , (xm, ym)) of pairs drawn i.i.d. according to a distribution D over [0, 1]d { 1}. An empirical risk minimizer on the sample is denoted ˆw argminw Hk,θ 1 m P i [m] ℓ(xi, yi, w). We wish to show an upper bound on ℓ( ˆw) ℓ(w ). We will prove the following theorem: Theorem 6 For k r 0, and m k, with probability 1 δ over the random draw of S, ℓ( ˆw) ℓ(w ) + O(rk(ln(d) ln3(3m) + ln(1/δ))) m + O(r log(1/δ)) Sabato, Shalev-Shwartz, Srebro, Hsu, and Zhang The proof strategy is based on considering the loss on negative examples and the loss on positive examples separately. Denote ℓ (w, D) = E(X,Y ) D[ℓ(X, Y, w) | Y = 1], and ℓ+(w, D) = E(X,Y ) D[ℓ(X, Y, w) | Y = +1]. For a given sample, denote ˆℓ (w) = ˆE[ℓ(X, Y, w) | Y = 1] and similarly for ˆℓ+(w). Denote p+ = E(X,Y ) D[Y = +1] and ˆp+ = ˆE[Y = +1], and similarly for p and ˆp . As Theorem 21 in Section 5 below shows, the rate of uniform convergence of ˆℓ (w) to ℓ (w) for all w Hk,θ is Ω( p k2/m), which is slower than the desired O( p θk/m). Therefore, uniform convergence analysis for Hk,θ cannot provide a tight result. Instead, we define a subset Ub Hk,θ, such that with probability at least 1 δ, the empirical risk minimizer of a random sample is in Ub. We show that a uniform convergence rate of O( p θk/m) does in fact hold for all w Ub. The analysis of uniform convergence of the negative loss is carried out in Section 4.1. For positive labels, uniform convergence rates over Hk,θ in fact suffice to provide the desired guarantee. This analysis is provided in Section 4.2. The analysis uses the results in Section 3 for the online algorithm to construct a small cover of the relevant function class. This then bounds the Rademacher complexity of the class and leads to a uniform convergence guarantee. In Section 4.3, the two convergence results are combined, while taking into account the mixture of positive and negative labels in D. 4.1 Convergence on Negative Labels We now commence the analysis for negative labels. Denote by D the distribution of (X, Y ) D conditioned on Y = 1, so that P(X,Y ) D [Y = 1] = 1, and P(X,Y ) D [X = x] = P(X,Y ) D[X = x | Y = 1]. For b 0 define Ub(D) = {w Rd + | w 1 k, ED[ w, X | Y = 1] b}. Note that Ub(D) Hk,θ. We now bound the rate of convergence of ˆℓ to ℓ for all w Ub(D). We will then show that b can be set so that with high probability ˆw Ub(D). Our technique is related to local Rademacher analysis (Bartlett et al., 2005), in that the latter also proposes to bound the Rademacher complexity of subsets of a function class, and uses these bounds to provide tighter convergence rates. Our analysis is better tailored to the Winnow loss, by taking into account the different effects of the negative and positive labels. The convergence rate for Ub(D) is bounded by first bounding RL m(Ub(D), D ), the Rademacher complexity of the linear loss for the distribution over the examples with negative labels, and then concluding a similar bound on Rm(Ub(D), D). We start with a more general bound on RL m. Lemma 7 For a fixed distribution over D over [0, 1]d { 1}, let αj = E(X,Y ) D[X[j]], and let µ Rd +. Define Uµ = {w Rd + | w, µ 1}. Then if dm 3, RL m(Uµ, D) max j:αj>0 1 µj m max αj, ln(dm) Learning Sparse Low-Threshold Linear Classifiers Proof Assume w.l.o.g that αj > 0 for all j (if this is not the case, dimensions with αj = 0 can be removed because this implies that X[j] = 0 with probability 1). 2 RL m(Uµ, S) = Eσ sup w: w,µ 1 i=1 σi w, xi sup w: w,µ 1 w, i=1 σi xi[j] Therefore, using Massart s lemma (Massart, 2000, Lemma 5.2) and denoting ˆαj = 1 m Pm i [m] xi[j], we have: RL m(Uµ, S) p P i xi[j]2 p P i xi[j] µ[j] m max j ˆαj µ[j]2 . Taking expectation over S and using Jensen s inequality we obtain RL m(Uµ, D) = ES[RL m(Uµ, S)] m ES[max j ˆαj µ[j]2 ] By Bernstein s inequality (Proposition 1), with probability 1 δ over the choice of {xi}, for all j [d] m max αj, ln(d/δ) And, in any case, ˆαj 1. Therefore, max j ˆαj µ[j]2 max j 1 µ[j]2 m max αj, ln(d/δ) Choose δ = 1/m and let j be a maximizer of the above. Consider two cases. If αj < ln(dm)/m then max j ˆαj µ[j]2 max j 1 µ[j]2 4 ln(dm) max j ˆαj µ[j]2 max j 1 µ[j]2 (δ + 3αj) max j 4αj µ[j]2 . Sabato, Shalev-Shwartz, Srebro, Hsu, and Zhang All in all, we have shown RL m(Uµ, D) max j 1 µ[j] m max n αj, ln(dm) The lemma above can now be used to bound the Rademacher complexity of the linear loss for D . Lemma 8 For any distribution D over (X, Y ) [0, 1]d { 1}, if dm 3, RL m(Ub(D), D ) m max b, k ln(dm) Proof Let αj = E(X,Y ) D [X[j]]. Let J = {j [d] | αj b k}, and J = {j [d] | αj < b k}. For a vector v Rd and a set I [d], denote by v[I] the vector which is obtained from v by setting the coordinates not in I to zero. Let ((X1, Y1), . . . , (Xm, Ym)) Dm . By the definition of RL m, with Rademacher random variables ϵ1, . . . , ϵm (see Eq. 7), we have RL m(Ub(D), D ) sup w Ub(D) i=1 ϵi Yi w, Xi sup w Ub(D) i=1 ϵi Yi w[J], Xi[J] + i=1 ϵi Yi w[ J], Xi[ J] sup w Ub(D) i=1 ϵi Yi w[J], Xi[J] sup w Ub(D) i=1 ϵi Yi w[ J], Xi[ J] = RL m(Ub(D), D1) + RL m(Ub(D), D2), (14) where D1 is the distribution of (X[J], Y ), where (X, Y ) D , and D2 is the distribution of (X[ J], Y ). We now bound the two Rademacher complexities of the right-hand side using Lemma 7. To bound RL m(Ub(D), D1), define Uµ as in Lemma 7 for µ Rd +, and define µ1 Rd + by µ1[j] = αj/b. It is easy to see that Ub(D) Uµ1. Therefore RL m(Ub(D), D1) RL m(Uµ1, D1). By Lemma 7 and the definition of µ1 RL m(Uµ1) max j J 1 µ1[j] m max αj, ln(dm) = max j J b αj m max αj, ln(dm) Learning Sparse Low-Threshold Linear Classifiers By the definition of J, for all j J we have b αj k. It follows that RL m(Uµ1, D1) m max b, k ln(dm) To bound RL m(Ub(D), D2), define µ2 Rd + by µ2[j] = 1 k. Note that Uµ2 = Hk,θ and Ub(D) Hk,θ, hence RL m(Ub(D), D2) RL m(Uµ2, D2). By Lemma 7 and the definition of µ2 RL m(Uµ2, D2) max j J 1 µ2[j] m max αj, ln(dm) m max kαj, k ln(dm) By the definition of J, for all j J we have kαj b. Therefore RL m(Uµ2, D2) m max b, k ln(dm) Combining Eq. (14), Eq. (15) and Eq. (16) we get the statement of the theorem. Finally, the bound on RL m(Ub(D), D) is used in the following theorem to obtain a uniform convergence result of the negative loss for predictors in Ub(D). Theorem 9 Let b 0. There exists a universal constant C such that for any distribution D over [0, 1]d { 1}, with probability 1 δ over samples of size m, for any w Ub(D), ℓ (w) ˆℓ (w) + C kb ln(d/δ) + |r | mˆp + k ln(dmˆp /δ) Proof Define φ : R R by φ(z) = [r z]+. Since P(X,Y ) D[Y = 1] = 1, the Winnow loss on pairs (X, Y ) drawn from D is exactly φ(Y w, X ). Note that φ is an application of a 1-Lipschitz function to a translation of the linear loss. Thus, by the properties of the Rademacher complexity (Bartlett and Mendelson, 2002) and by Lemma 8 we have, for dm 3, Rm(Ub(D), D ) RL m(Ub(D), D ) m max b, k ln(dm) Assume that r 0. By Talagrand s inequality (see, e.g., Boucheron et al., 2005, Theorem 5.4), with probability 1 δ over samples of size m drawn from D , for all w Ub(D) ℓ(w) ˆℓ(w) + 2Rm(Ub(D), D ) + 2 supw Ub(D) Var D [ℓ(X, Y, w)] ln(1/δ) m + 4k ln(1/δ) Sabato, Shalev-Shwartz, Srebro, Hsu, and Zhang To bound Var D [ℓ(X, Y, w)], note that ℓ(X, Y, w) [0, k]. In addition, PD [Y = 1] = 1, thus with probability 1, ℓ(X, Y, w) = [r + w, X ]+ w, x , where the last inequality follows from the assumption r 0. Therefore, for any w Ub(D) Var D [ℓ(X, Y, w)] E[ℓ2(X, Y, w)] ED [kℓ(X, Y, w)] k ED [ w, X ] kb. (20) Combining Eq. (18), Eq. (19) and Eq. (20) we conclude that there exists a universal constant C such that for any w Ub(D), if a sample of size m is drawn i.i.d. from D , then ℓ(w) ˆℓ(w) + C m + k ln(dm/δ) If r > 0, ˆℓ (w) ℓ (w) is identical to the case r = 0, thus the same result holds. To get Eq. (17), consider a sample of size m drawn from D instead of D . In this case, ℓ(w, D ) = ℓ (w, D), ˆℓ(w, D ) = ˆℓ (w, D), and the effective sample size for D is mˆp . We now show that with an appropriate setting of b, ˆw Ub(D) with high probability over the draw of a sample from D. First, the following lemma provides a sample-dependent guarantee for ˆw. Lemma 10 Let ˆw and ˆp be defined as above and let ˆE := ˆES for the fixed sample S defined above. Then ˆE[ ˆw, X | Y = 1] r Proof Let m+ = |{i | yi = +1}|, and m = |{i | yi = 1}|. By the definition of the hinge function and the fact that xi, ˆw 0 for all i we have that yi= 1 xi, ˆw X yi= 1 (r + xi, ˆw ) yi=+1 [r xi, ˆw ]+ + X yi= 1 [r + xi, ˆw ]+ i [m] ℓ(xi, yi, ˆw). By the optimality of ˆw, P i [m] ℓ(xi, yi, ˆw) P i [m] ℓ(xi, yi, 0) = m+r + m [r ]+. Therefore X yi= 1 xi, ˆw m+r + m ([r ]+ r ) = m+r + m [ r ]+ (m+ + m )r = mr, where we have used the definitions of r and r to conclude that [ r ]+ r. Dividing both sides by m we conclude our proof. The following lemma allows converting the sample-dependent restriction on ˆw given in Lemma 10 to one that holds with high probability over samples. Lemma 11 For any distribution over [0, 1]d, with probability 1 δ over samples of size n, for any w Hk,θ E[ w, X ] 2ˆE[ w, X ] + 16k ln(d Learning Sparse Low-Threshold Linear Classifiers Proof For every j [d], denote αj = E[X[j]]. Denote ˆαj = ˆE[X[j]]. By Bernstein s inequality (Proposition 1), with probability 1 δ, n max αj, ln(1/δ) ˆαj + max αj 2 , 8 ln(1/δ) where the last inequality can be verified by considering the cases αj 16 ln(1/δ) n and αj 16 ln(1/δ) n . Applying the union bound over j [d] we obtain that with probability of 1 δ over samples of size n, for any w Hk,θ E[ w, X ] = w, α X 2 + 8 ln(d/δ) ˆE[ w, X ] + 1 2E[ w, X ] + 8 ln(d/δ) Thus E[ w, X ] 2ˆE w, X + 16k ln(d/δ) Combining the two lemmas above, we conclude that with high probability, ˆw Ub for an appropriate setting of b. Lemma 12 If p 8 ln(1/δ) m , then with probability 1 δ over samples of size m, ˆw Ub(D), where p + 32k ln(2d/δ) Proof Apply Lemma 11 to D . With probability of 1 δ over samples of size n drawn from D , ED [ w, X ] 2ˆED [ w, X ] + 16k ln(d/δ) Now, consider a sample of size m drawn according to D. Then ED [ ] = ED[ | Y = 1], and n = mˆp . Therefore, with probability 1 2δ, E[ w, X | Y = 1] 2ˆE[ w, X | Y = 1] + 16k ln(d/δ) ˆp + 16k ln(d/δ) p + 32k ln(d/δ) where the second inequality follows from Lemma 10, and the last inequality follows from the assumption on p and Proposition 2. This lemma shows that to bound the sample complexity of an ERM algorithm for the Winnow loss, it suffices to bound the convergence rates of the empirical loss for w Ub(D), with b defined as in Eq. (21). Thus, we will be able to use Theorem 9 to bound the convergence of the loss on negative examples. Sabato, Shalev-Shwartz, Srebro, Hsu, and Zhang 4.2 Convergence on Positive Labels For positive labels, we show a uniform convergence result that holds for the entire class Hk,θ. The idea of the proof technique below is as follows. First, following a technique in the spirit of the one given by Zhang (2002), we show that the regret bound for the online learning algorithm presented in Section 3 can be used to construct a small cover of the set of loss functions parameterized by Hk,θ. Second, we convert the bound on the size of the cover to a bound on the Rademacher complexity, thus showing a uniform convergence result. This argument is a refinement of Dudley s entropy bound (Dudley, 1967), which is stated in explicit terms by Srebro et al. (2010, Lemma A.3). We first observe that by Theorem 4, if the conditions of the theorem hold and there is u such that ft(u) = 0 for all t, then t=1 ft(wt) 4 Let k r 0 be two real numbers and let W Rd +. Let φw denote the function defined by φw(x, y) = ℓ(x, y, w), and consider the class of functions ΦW = {φw | w W}. Given S = ((x1, y1), . . . , (xm, ym)), where xi [0, 1]d and yi { 1}, we say that (ΦW , S) is ( , ϵ)-properly-covered by a set V ΦW if for any f ΦW there is a g V such that (f(x1, y1), . . . , f(xm, ym)) (g(x1, y1), . . . , g(xm, ym)) ϵ. We denote by N (W, S, ϵ) the minimum value of an integer N such that exists a V ΦW of size N that ( , ϵ)-properly-covers (ΦW , S). The following lemma bounds the covering number for FW , for sets S with all-positive labels yi. Lemma 13 Let S = ((x1, 1), . . . , (xm, 1)), where xi [0, 1]d. Then, ln N (Hk,θ, S, ϵ) 16 rk ln(d) ln(3m)/ϵ2. Proof We use a technique in the spirit of the one given by Zhang (2002). Fix some u, with u 0 and u 1 k. For each i let ( | w, xi u, xi | if u, xi r [r w, xi ]+ o.w. and define the function Gu(w) = max i gu i (w) . It is easy to verify that for any w, (φw(x1, 1), . . . , φw(xm, 1)) (φu(x1, 1), . . . , φu(xm, 1)) Gu(w). Now, clearly, Gu(u) = 0. In addition, for any w 0, a sub-gradient of Gu at w is obtained by choosing i that maximizes gu i (w) and then taking a sub-gradient of gu i , which is of the form z = αxi where α { 1, 0, 1}. If α { 1, 1}, it is easy to verify that X j w[j]z[j]2 w, xi gu i (w) + r = Gu(w) + r . Learning Sparse Low-Threshold Linear Classifiers If α = 0 then clearly P j w[j]z[j]2 Gu(w) + r as well. We can now use Eq. (23) by setting ft = Gu for all t, setting α = 1 and β = r in Eq. (9), and noting that since xi [0, 1]d, we have zt [ 1, 1]d for all t. If η 1 we have ηzt[i] 1 for all t, i as needed. Since η = q r T , this holds for all T k ln(d)/r. We conclude that if we run the unnormalized EG algorithm with T k ln(d)/r and η and λ as required, we get T X t=1 Gu(wt) 4 p Dividing by T and using Jensen s inequality we conclude Denote wu = 1 T P t wt. Setting ϵ = 4 q T , it follows that the following set is a ( , ϵ)- proper-cover for (FHk,θ, S): V = {wu | u Hk,θ}. Now, we only have left to bound the size of V . Consider again the unnormalized EG algorithm. Since zt = αxi for some α { 1, 0, +1} and i {1, . . . , m}, at each round of the algorithm there are only two choices to be made: the value of i and the value of α. Therefore, the number of different vectors produced by running unnormalized EG for T iterations on Gu for different values of u is at most (3m)T . Thus |V | (3m)T . By our definition of ϵ, ln |V | T ln(3m) 16rk ln(d) ln(3m)/ϵ2. This concludes our proof. Using this result we can bound from above the covering number defined using the Euclidean norm: We say that (ΦW , S) is (2, ϵ)-properly-covered by a set V ΦW if for any f ΦW there is a g V such that 1 m (f(x1, y1), . . . , f(xm, ym)) (g(x1, y1), . . . , g(xm, ym)) 2 ϵ. We denote by N2(W, S, ϵ) the minimum value of an integer N such that exists a V ΦW of size N that (2, ϵ)-properly-covers (ΦW , S). It is easy to see that for any two vectors u, v Rm, 1 m u v 2 u v . It follows that for any W and S, we have N2(W, S, ϵ) N (W, S, ϵ). The N2 covering number can be used to bound the Rademacher complexity of (ΦW , S) using a refinement of Dudley s entropy bound (Dudley, 1967), which is stated explicitly by Srebro et al. (2010, Lemma A.3). The lemma states that for any ϵ 0, R(W, S) 4ϵ + 10 m ln N2(W, S, γ) dγ, Sabato, Shalev-Shwartz, Srebro, Hsu, and Zhang where B is an upper bound on the possible values of f ΦW on members of S. For S with all-positive labels we clearly have B r. Combining this with Lemma 13, we get R(Hk,θ, S) C ϵ + 1 m rk ln(d) ln(3m)/γ dγ = C rk ln(d) ln(3m) Setting ϵ = rk/m we get R(Hk,θ, S) C rk ln(d) ln3(3m) Thus, for any distribution D over [0, 1]d { 1} that draws only positive labels, we have Rm(Hk,θ, D) C rk ln(d) ln3(3m) By Rademacher sample complexity bounds (Bartlett and Mendelson, 2002), and since ℓfor positive labels is bounded by r, we can immediately conclude the following: Theorem 14 Let k r 0. For any distribution D over [0, 1]d { 1} that draws only positive labels, with probability 1 δ over samples of size m, for any w Hk,θ, ℓ+(w) ˆℓ+(w) + C rk ln(d) ln3(3m) rk(ln(d) ln3(3m) + ln(1/δ)) 4.3 Combining Negative and Positive Losses We have shown separate convergence rate results for the loss on positive labels and for the loss on negative labels. We now combine these results to achieve a convergence rate upper bound for the full Winnow loss. To do this, the convergence results given above must be adapted to take into account the fraction of positive and negative labels in the true distribution as well as in the sample. The following theorems accomplish this for the negative and the positive cases. First, a bound is provided for the positive part of the loss. Theorem 15 There exists a universal constant C such that for any distribution D over [0, 1]d { 1}, with probability 1 δ over samples of size m p+ℓ+( ˆw) ˆp+ˆℓ+( ˆw) + C rk(ln(kd) ln3(m) + ln(3/δ)) Learning Sparse Low-Threshold Linear Classifiers Proof First, if p+ 8 ln(1/δ) m then the theorem trivially holds. Therefore we assume that p+ 8 ln(1/δ) m . We have p+ℓ+( ˆw) = ˆp+ˆℓ+( ˆw) + (p+ ˆp+)ˆℓ+( ˆw) + p+(ℓ+( ˆw) ˆℓ+( ˆw)). (24) To prove the theorem, we will bound the two rightmost terms. First, to bound (p+ ˆp+)ˆℓ+( ˆw), note that by definition of the loss function for positive labels we have that ˆℓ+( ˆw) [0, r]. Therefore, Bernstein s inequality (Proposition 1) implies that with probability 1 δ/3 (p+ ˆp+)ˆℓ+( ˆw) 2r m max p+, ln(3/δ) Second, to bound p+(ℓ+( ˆw) ˆℓ+( ˆw)), we apply Theorem 14 to the conditional distribution induced by D on X given Y = 1, to get that with probability 1 δ/3 p+(ℓ+( ˆw) ˆℓ+( ˆw)) p+ C rk(ln(d) ln3(3m) + ln(3/δ)) Using our assumption on p+ we obtain from Proposition 2 that with probability 1 δ/3, p+/ˆp+ 2. Therefore, p+/ p 2. Thus, with probability 1 2δ/3, p+(ℓ+( ˆw) ˆℓ+( ˆw)) C rk(ln(d) ln3(3m) + ln(3/δ)) Combining Eq. (24), Eq. (25) and Eq. (26) and applying the union bound, we get the theorem. Second, a bound is provided for the negative part of the loss. Theorem 16 There exists a universal constant C such that for any distribution D over [0, 1]d { 1}, with probability 1 δ over samples of size m p ℓ ( ˆw) ˆp ˆℓ ( ˆw) + C m + k ln(dm/δ) Proof First, if p 8 ln(1/δ) m then the theorem trivially holds (since ℓ ( ˆw) [0, r + k]). Therefore we assume that p 8 ln(1/δ) m . Thus, by Proposition 2, ˆp p /2. We have p ℓ ( ˆw) = ˆp ˆℓ ( ˆw) + (p ˆp )ˆℓ ( ˆw) + p (ℓ ( ˆw) ˆℓ ( ˆw)). (28) To prove the theorem, we will bound the two rightmost terms. First, to bound (p ˆp )ˆℓ ( ˆw), note that by Bernstein s inequality (Proposition 1) and our assumption on p , with probability 1 δ m max p , ln(1/δ) Sabato, Shalev-Shwartz, Srebro, Hsu, and Zhang By Lemma 10 and Proposition 2, ˆℓ ( ˆw) 2r p . In addition, by definition ˆℓ ( ˆw) r + k 2k. Therefore (p ˆp )ˆℓ ( ˆw) 4 min 2r Now, if k > 2r/p , then the right-hand of the above becomes (r/p ) r ln(1/δ) k r ln(1/δ) Otherwise, k 2r/p and the right-hand of Eq. (29) becomes (2r/k) ln(1/δ) k r ln(1/δ) All in all, we have shown that (p ˆp )ˆℓ ( ˆw) 8 Second, to bound p (ℓ ( ˆw) ˆℓ ( ˆw)), recall that by Lemma 12, we have ˆw Ub(D), where p + 32k ln(d/δ) 2r + k ln(d/δ) Thus, by Theorem 9, with probability 1 δ ℓ (w) ˆℓ (w) + C mˆp + k ln(dm/δ) Since ˆp p /2, ℓ (w) ˆℓ (w) + C mp + k ln(dm/δ) for some other constant C. Therefore, substituting b for its upper bound we get p (ℓ (w) ˆℓ (w)) C m + k ln(dm/δ) Combining Eq. (28), Eq. (30) and Eq. (31) we get the statement of the theorem. Finally, we prove our main result for the sample complexity of ERM algorithms for Winnow. Learning Sparse Low-Threshold Linear Classifiers Proof (Proof of Theorem 6) From Theorem 15 and Theorem 16 we conclude that with probability 1 δ, ℓ( ˆw) = p ℓ ( ˆw) + p+ℓ+( ˆw) ˆp ˆℓ ( ˆw) + ˆp+ˆℓ+( ˆw) + O(rk(ln(d) ln3(3m) + ln(1/δ))) Now, ˆp ˆℓ ( ˆw) + ˆp+ˆℓ+( ˆw) = ˆℓ( ˆw) ˆℓ(w ). (33) We have E[ℓ(X, Y, w )] = ℓ(w ) ℓ(0) r. By Bernstein s inequality (Proposition 1), with probability 1 δ ˆℓ(w ) = ˆE[ℓ(X, Y, w )] E[ℓ(X, Y, w )] + 2r m max E[ℓ(X, Y, w )] r , ln(1/δ) m + 2r ln(1/δ) Combining this with Eq. (33), we get that with probability 1 δ ˆp ˆℓ ( ˆw) + ˆp+ˆℓ+( ˆw) ℓ(w ) + 2 m + 2r ln(1/δ) In light of Eq. (32), we conclude Eq. (13) Theorem 6 shows that using empirical risk minimization, the loss of the obtained predictor converges to the loss of the optimal predictor at a rate of the order Up to logarithmic factors, this is the best possible rate for learning in the generalized Winnow setting. This is shown in the next section, in Theorem 17. We also show, in Theorem 21, that this rate cannot be obtain via standard uniform convergence analysis. 5. Lower Bounds In this section we provide lower bounds for the learning rate and for the uniform convergence rate of the Winnow loss ℓθ. 5.1 Learning Rate Lower Bound Fix a threshold θ. The best Winnow loss for a distribution D over [0, 1]d { 1} using a hyperplane from a set W Rd + is denoted by ℓ θ(W) = minw W ℓθ(w). The following result shows that even if the data domain is restricted to the discrete domain {0, 1}d, the number of samples required for learning with the Winnow loss grows at least linearly in θk. This resolves an open question posed by Littlestone (1988). Sabato, Shalev-Shwartz, Srebro, Hsu, and Zhang Theorem 17 Let k 1 and let θ [1, k/2]. The sample complexity of learning Hk,θ with respect to the loss ℓθ is Ω(θk/ϵ2). That is, for all ϵ (0, 1/2) if the training set size is m = o(θk/ϵ2), then for any learning algorithm, there exists a distribution such that the classifier, h : {0, 1}d R+, that the algorithm outputs upon receiving m i.i.d. examples satisfies ℓθ(h) ℓ θ(Hk,θ) > ϵ with a probability of at least 1/4. The construction which shows the lower bound proceeds in several stages: First, we prove that there exists a set of size k2 in { 1}k2 which is shattered on the linear loss with respect to predictors with a norm bounded by k. Then, apply a transformation on this construction to show a set in {0, 1}2k2+1 which is shattered on the linear loss with a threshold of k/2. In the next step, we adapt the construction to hold for any value of the threshold. Finally, we use the resulting construction to prove Theorem 17. The construction uses the notion of a Hadamard matrix. A Hadamard matrix of order n is an n n matrix Hn with entries in { 1} such that Hn HT n = n In. In other words, all rows in the matrix are orthogonal to each other. Hadamard matrices exist at least for each n which is a power of 2 (Sylvester, 1867). The first lemma constructs a shattered set for the linear loss on { 1}k2. Lemma 18 Assume k is a power of 2, and let d = k2. Let x1, . . . , xd { 1}d be the rows of the Hadamard matrix of order d. For every y { 1}d, there exists a w W = {w [ 1, 1]d | w k} such that for all i [d], y[i] w, xi = 1. Proof By the definition of a Hadamard matrix, for all i = j, xi, xj = 0. Given y { 1}d, set w = 1 d P j [d] yjxj. Then for each i, yi w, xi = yi 1 d j [d] yj xi, xj = 1 dy2 i xi, xi = 1 d xi 2 2 = 1. It is left to show that w W . First, for all i [d], we have |w[i]| = |1 j [d] yjxj[i]| 1 j [d] |xj[i]| = 1, which yields w [ 1, 1]d. Second, using w 1 w 2 2 = w, w = 1 i,j [d] yixi, yjxj = 1 i [d] y2 i xi, xi = 1 i [d] d = 1, we obtain that w 1 The next lemma transforms the construction from Lemma 18 to a linear loss with a threshold of k/2. Lemma 19 Let k be a power of 2 and let d = 2k2 + 1. There is a set {x1, . . . , xk2} {0, 1}d such that for every y { 1}k2, there exists w Hk,θ such that for all i [k2], y[i]( w, xi k/2) = 1 Learning Sparse Low-Threshold Linear Classifiers Proof From Lemma 18 we have that there is a set X = {x1, . . . , xk2} { 1}k2 such that for each labeling y { 1}k2, there exists a wy [ 1, 1]d with wy 1 k such that for all i [k2], y[i] wy, xi = 1. We now define a new set X = { x1, . . . , xk2} {0, 1}d based on X that satisfies the requirements of the lemma. For each i [k2] let xi = [ 1+xi 2 , 1], where [ , , ] denotes a concatenation of vectors and 1 is the all-ones vector. In words, each of the first k2 coordinates in xi is 1 if the corresponding coordinate in xi is 1, and zero otherwise. Each of the next k2 coordinates in xi is 1 if the corresponding coordinate in xi is 1, and zero otherwise. The last coordinate in xi is always 1. Now, let y { 1}k2 be a desired labeling. We defined wy based on wy as follows: wy = [[wy]+, [ wy]+, k wy 1 2 ], where by z = [v]+ we mean that z[j] = max{v[j], 0}. In words, the first k2 coordinates of wy are copies of the positive coordinates of wy, with zero in the negative coordinates, and the next k2 coordinates of wy are the absolute values of the negative coordinates of wy, with zero in the positive coordinates. The last coordinate is a scaling term. We now show that wy has the desired property on X. For each i [k2], 2 + k |wy|1 It follows that yi( wy, xi k/2) = y2 i /2 = 1/2. Now, clearly wy Rd +. In addition, wy 1 = wy 1 + k wy 1 Hence wy Hk,θ as desired. The last lemma adapts the previous construction to hold for any threshold. Lemma 20 Let z be a power of 2 and let k such that z divides k. Let d = 2kz +k/z. There is a set {x1, . . . , xzk} {0, 1}d such that for every y { 1}zk, there exists a w Hk,θ such that for all i [zk], y[i]( w, xi z/2) = 1 Proof By Lemma 19 there is a set X = {x1, . . . , xz2} {0, 1}2z2+1 such that for all y { 1}z2, there exists a wy R2z2+1 + such that wy 1 z and for all i [z2], y[i]( wy, xi z/2) = 1 2. We now construct a new set X = { x1, . . . , xzk} {0, 1}2kz+k/z as follows: For i [zk], let n = i/z2 and m = i mod z2, so that i = nz2+m.The vector xi is the concatenation of kz z2 = k z vectors, each of which is of dimension 2z2 +1, where all the vectors are the all-zeros vector, except the (n + 1) th vector which equals to xm+1. That is: R2z2+1 z}|{ 0 , . . . , R2z2+1 z}|{ 0 , block n+1 z }| { xm+1 , R2z2+1 z}|{ 0 , . . . , R2z2+1 z}|{ 0 ] R k z (2z2+1) . Sabato, Shalev-Shwartz, Srebro, Hsu, and Zhang Given y { 1}kz, let us rewrite it as a concatenation of k/z vectors, each of which in { 1}z2, namely, { 1}z2 z}|{ y(1) , . . . , { 1}z2 z }| { y(k/z)] { 1}kz . Define w y as the concatenation of k/z vectors in { 1}z2, using wy defined above for each y { 1}z2, as follows: R2z2+1 + z }| { w y(1) , . . . , R2z2+1 + z }| { w y(k/z)] R k z (2z2+1) . For each i such that n = i/z2 and m = i mod z2, we have w y, xi z/2 = w y(n+1), xm+1 z/2 = 1 2 y(n + 1)[m + 1]. Now y(n + 1)[m + 1] = y[i], thus we get y[i]( w y, xi z/2) = 1 2 as desired. Finally, we observe that w y 1 = P n [k/z] w y(n) 1 k/z z = k, hence w y Hk,θ. Finally, the construction above is used to prove the convergence rate lower bound. Proof (Proof of Theorem 17) Let k 1, θ [1 2]. Define z = 2θ. Let n = max{n | 2n z}, and let m = max{m | m2n k}. Define z = 2n and k = m2n. We have that z is a power of 2 and z divides k. Let d = 2 k z + k/ z. By Lemma 20, there is a set X = {x1, . . . , x z k} {0, 1} d such that for every y { 1}|X|, there exists a wy Hk,θ such that for all i [ z k], y[i]( wy, xi z/2) = 1 2. Now, let d = d + 1, and define wy = [wy, z z 2 ] and xi = [xi, 1]. It follows that y[i]( wy, xi θ) = y[i]( wy, xi z/2) = y[i]( wy, xi + z/2 z/2 z/2) = y[i]( wy, xi z/2) = 1 We conclude that for all i [ z k], ℓθ( xi, y[i], wy) = 0 and ℓθ( xi, 1 y[i], wy) = 1. Moreover, sign( wy, xi θ) = y[i]. Now, for a given w define hw(x) = sign( w, xi θ), and consider the binary hypothesis class H = {hw | w Hk,θ} over the domain X. Our construction of wy shows that the set X is shattered by this hypothesis class, thus its VC dimension is at least |X|. By VCdimension lower bounds (e.g., Anthony and Bartlett, 1999, Theorem 5.2), it follows that for any learning algorithm for H, if the training set size is o(|X|/ϵ2), then there exists a distribution over X so that with probability greater than 1/64, the output ˆh of the algorithm satisfies E[ˆh(x) = y] > min w Hk,θ E[hw(x) = y] + ϵ . (34) Next, we show that the existence of a learning algorithm for Hk,θ with respect to ℓθ whose sample complexity is o(|X|/ϵ2) would contradict the above statement. Indeed, let w be a minimizer of the right-hand side of Eq. (34), and let y be the vector of predictions Learning Sparse Low-Threshold Linear Classifiers of w on X. As our construction of wy shows, we have ℓθ( wy ) = E[hw (x) = y]. Now, suppose that some algorithm learns ˆw Hk,θ so that ℓθ( ˆw) ℓ θ(Hk,θ) + ϵ. This implies that ℓθ( ˆw) ℓθ( wy ) + ϵ = E[hw (x) = y] + ϵ . In addition, define a (probabilistic) classifier, ˆh, that outputs the label +1 with probability p( ˆw, x) where p( ˆw, x) = min{1, max{0, 1/2 + ( ˆw, x θ)}}. Then, it is easy to verify that P[ˆh(x) = y] ℓθ(x, y, ˆw) . Therefore, E[ˆh(x) = y] ℓθ( ˆw), and we obtain that E[ˆh(x) = y] E[hw (x) = y] + ϵ , which leads to the desired contradiction. We next show that the uniform convergence rate for our problem is in fact slower than the achievable learning rate. 5.2 Uniform Convergence Lower Bound The next theorem shows that the rate of uniform convergence for our problem is asymptotically slower than the rate of convergence of the empirical loss minimizer given in Theorem 6, even if the drawn label in a random pair is negative with probability 1. This indicates that indeed, a more subtle argument than uniform convergence is needed to show that ERM learns at a rate of O( p θk/m), as done in Section 4. Theorem 21 Let k 1, and assume θ k/2. There exists a distribution D over {0, 1}k2+1 Y such that x {0, 1}d, P[Y = 1 | X = x] = 1, and ℓ (Hk,θ, D) = [r ]+, and such that with probability at least 1/2 over samples S Dm, w Hk,θ, |ℓ(w, S) ℓ(w, D)| Ω( p k2/m). (35) This claim may seem similar to well-known uniform convergence lower bounds for classes with a bounded VC dimension (see, e.g., Anthony and Bartlett, 1999, Chapter 5). However, these standard results rely on constructions with non-realizable distributions, while Theorem 21 asserts the existence of a realizable distribution which exhibits this lower bound. To prove this theorem we first show two useful lemmas. The first lemma shows that a lower bound on the uniform convergence of a function class can be derived from a lower bound on the Rademacher complexity of a related function class. Lemma 22 Let Z be a set, and consider a function class F [0, 1]Z. Let D be a distribution over Z. Let F = {(x1, x2) f(x1) f(x2) | f F}. With probability at least 1 δ over samples S Dm, f F, |EX S[f(X)] EX D[f(X)]| 1 4Rm( F, D D) Sabato, Shalev-Shwartz, Srebro, Hsu, and Zhang Proof Denote E[f, S] = EX S[f(X)], and E[f, D] = EX D[f(X)]. Consider two independent samples S = (X1, . . . , Xm), S = (X 1, . . . , X m) Dm. Let σ = (σ1, . . . , σm) be Rademacher random variables, and let S (D D)m. We have sup f F |E[f, S] E[f, D]| sup f F |E[f, S] E[f, D]| + sup f F |E[f, S ] E[f, D]| sup f F |E[f, S] E[f, D]| + |E[f, S ] E[f, D]| sup f F |E[f, S] E[f, S ]| i [m] f(Xi) f(X i) i [m] σi f(Xi) = Rm( F, D D)/2. We have left to show a lower bound with high probability. Define g(S) = supf F |E[f, S] E[f, D]|. Any change of one element in S can cause g(S) to change by at most 1/m, Therefore, by Mc Diarmid s inequality, P[g(S) E[g(S)] t] exp( 2mt2). Eq. (36) thus holds with probability 1 δ. The next lemma provides a uniform convergence lower bound for a universal class of binary functions. Lemma 23 Let H = {0, 1}[n] be the set of all binary functions on [n]. Let D be the uniform distribution over [n]. For any n 45 and m 32n, with probability of at least 1 2 over i.i.d. samples of size m drawn from D, h H, |EX S[h(X)] EX D[h(X)]| r n 512m. Proof Let n 45 and m 32n. By Lemma 22, it suffices to provide a lower bound for Rm( H, D D). Fix a sample S = ((x1, x 1), . . . , (xm, x m)) (D D)m. We have 2 R( H, S) = Eσ i=1 σi(h(xi) h(x i)) Learning Sparse Low-Threshold Linear Classifiers where σ = (σ1, . . . , σm) are Rademacher random variables. For a given σ { 1}m, define hσ H such that hσ(j) = sign(P i:xi=j σi P i:x i=j σi). Then 2 R( H, S) Eσ i [m] σi(hσ(xi) hσ(x i)) i:xi=j σi X i:xi=j σi X Let cj(S) be the number of indices i such that exactly one of xi = j and x i = j holds. Then Eσ[| P i:xi=j σi P i:x i=j σi|] is the expected distance of a random walk of length cj(S), which can be bounded from below by p cj(S)/2 (Szarek, 1976). Therefore, Taking expectation over samples, we get R( H, D D) = ES (D D)m[R( H, S)] cj(S) . (37) Our final step is to bound ES p cj(S) . We have ES[cj(S)] = m 1 Var S[cj(S)] = m 1 Thus, by Chebyshev s inequality, P h cj(S) m Setting t = m 4n, and since m/n 32, ES p 16n. Plugging this into Eq. (37), we get that R( H, D D) p n 8m. By Lemma 22, it follows that with probability at least 1 δ over samples, f F, |EX S[f(X)] EX D[f(X)]| r n 128m Sabato, Shalev-Shwartz, Srebro, Hsu, and Zhang Fixing δ = 1/2, we get that since n 64 ln(2), the RHS is at least p n 512m. Using the two lemmas above, we are now ready to prove our uniform convergence lower bound. This is done by mapping a subset of Hk,θ to a universal class of binary functions over Θ(k2) elements from our domain. Note that for this lower bound it suffices to consider the more restricted domain of binary vectors. Proof (Proof of Theorem 21) Let q be the largest power of 2 such that q k. By Lemma 19, there exists a set of vectors Z = {z1, . . . , zq2} {0, 1}q2+1 such that for every t { 1}q2 there exists a wt Hk,θ such that for all i, t[i]( w, zi q/2) = 1 2. Denote U = {wt | t { 1}q2}. It suffices to prove a lower bound on the uniform convergence of U, since this implies the same lower bound for Hk,θ. Define the distribution D over Z { 1} such that for (X, Y ) D, X is drawn uniformly from z1, . . . , zq2 and Y = 1 with probability 1. Consider the set of functions H = {0, 1}Z, and for h H define th { 1}q2 such that for all i [q2], th[i] = 2h(zi) 1. For any i q2, we have ℓ(zi, 1, wth) = [r + w, zi ]+ = [r +(t[i]+k)/2]+ = [r +(k 1)/2+h(i)]+ = r +(k 1)/2+h(zi). The last equality follows since r 1 k 2 . It follows that for any h H and any sample S drawn from D, |ℓ(wth, S) ℓ(wth, D)| = |EX S[h(X)] EX D[h(X)]|. By Lemma 23, with probability of at least 1 2 over the sample S Dm, h H, |EX S[h(X)] EX D[h(X)]| Ω( p q2/m) = Ω( p Thus, with probability at least 1/2, w Hk,θ, |ℓ(wth, S) ℓ(wth, D)| Ω( p Acknowledgements Tong Zhang is supported by the following grants: NSF IIS1407939, NSF IIS1250985, and NIH R01AI116744. D. Angluin and L. G. Valiant. Fast probabilistic algorithms for Hamiltonian circuits and matchings. Journal of Computer and System Sciences, 18(2):155 193, April 1979. M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. P. Auer and M.K. Warmuth. Tracking the best disjunction. Machine Learning, 32(2): 127 150, 1998. Learning Sparse Low-Threshold Linear Classifiers P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2002. P. L. Bartlett, O. Bousquet, and S. Mendelson. Local rademacher complexities. Annals of Statistics, 33(4):1497 1537, 2005. S. Bernstein. The Theory of Probabilities. Gastehizdat Publishing House, Moscow, 1946. S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: a survey of recent advances. ESAIM: Probability and Statistics, 9:323 375, 2005. R.M. Dudley. The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290 330, 1967. C. Gentile. The robustness of the p-norm algorithms. Machine Learning, 53:265 299, 2003. S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Proceedings of NIPS, 2009. J. Kivinen and M. Warmuth. Additive versus exponentiated gradient updates for learning linear functions. Technical Report UCSC-CRL-94-16, University of California Santa Cruz, Computer Research Laboratory, 1994. N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285 318, 1988. N. Littlestone. Mistake Bounds and Logarithmic Linear-Threshold Learning Algorithms. Ph D thesis, U. C. Santa Cruz, March 1989. P. Massart. Some applications of concentration inequalities to statistics. In Annales de la Facult e des Sciences de Toulouse, volume 9:2, pages 245 303. Universit e Paul Sabatier, 2000. R.E. Schapire, Y. Freund, P. Bartlett, and W.S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. In Machine Learning: Proceedings of the Fourteenth International Conference, pages 322 330, 1997. To appear, The Annals of Statistics. S. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. Ph D thesis, The Hebrew University, 2007. S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194, 2012. N. Srebro, K. Sridharan, and A. Tewari. Smoothness, low-noise and fast rates. Co RR, abs/1009.3896, 2010. N. Srebro, K. Sridharan, and A. Tewari. On the universality of online mirror descent. Advances in Neural Information Processing Systems (NIPS), 2011. Sabato, Shalev-Shwartz, Srebro, Hsu, and Zhang J.J. Sylvester. Thoughts on inverse orthogonal matrices, simultaneous signsuccessions, and tessellated pavements in two or more colours, with applications to newton s rule, ornamental tile-work, and the theory of numbers. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 34(232):461 475, 1867. S.J. Szarek. On the best constants in the Khinchin inequality. Studia Math, 58(2), 1976. T. Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527 550, 2002.