# on_regularizing_rademacher_observation_losses__8a7202f6.pdf

On Regularizing Rademacher Observation Losses

Richard Nock Data61, The Australian National University & The University of Sydney richard.nock@data61.csiro.au

It has recently been shown that supervised learning linear classiﬁers with two of the most popular losses, the logistic and square loss, is equivalent to optimizing an equivalent loss over sufﬁcient statistics about the class: Rademacher observations (rados). It has also been shown that learning over rados brings solutions to two prominent problems for which the state of the art of learning from examples can be comparatively inferior and in fact less convenient: (i) protecting and learning from private examples, (ii) learning from distributed datasets without entity resolution. Bis repetita placent: the two proofs of equivalence are different and rely on speciﬁc properties of the corresponding losses, so whether these can be uniﬁed and generalized inevitably comes to mind. This is our ﬁrst contribution: we show how they can be ﬁt into the same theory for the equivalence between example and rado losses. As a second contribution, we show that the generalization unveils a surprising new connection to regularized learning, and in particular a sufﬁcient condition under which regularizing the loss over examples is equivalent to regularizing the rados (i.e. the data) in the equivalent rado loss, in such a way that an efﬁcient algorithm for one regularized rado loss may be as efﬁcient when changing the regularizer. This is our third contribution: we give a formal boosting algorithm for the regularized exponential rado-loss which boost with any of the ridge, lasso, SLOPE, ℓ , or elastic net regularizer, using the same master routine for all. Because the regularized exponential rado-loss is the equivalent of the regularized logistic loss over examples we obtain the ﬁrst efﬁcient proxy to the minimization of the regularized logistic loss over examples using such a wide spectrum of regularizers. Experiments with a readily available code display that regularization signiﬁcantly improves rado-based learning and compares favourably with example-based learning.

1 Introduction

What kind of data should we use to train a supervised learner ? A recent result has shown that minimising the popular logistic loss over examples with linear classiﬁers (in supervised learning) is equivalent to the minimisation of the exponential loss over sufﬁcient statistics about the class known as Rademacher observations (rados, [Nock et al., 2015]), for the same classiﬁer. In short, we ﬁt a classiﬁer over data that is different from examples, and the same classiﬁer generalizes well to new observations. It has been shown that rados offer solutions for two problems for which the state of the art involving examples can be comparatively signiﬁcantly inferior:

protection of the examples privacy from various algebraic, geometric, statistical and computational standpoints, and learning from private data [Nock et al., 2015]; learning from a large number of distributed datasets without having to perform entity resolution between datasets [Patrini et al., 2016].

Quite remarkably, the training time of the algorithms involved can be smaller than it would be on examples, by orders of magnitude [Patrini et al., 2016]. Two key problems remain however: the

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

accuracy of learning from rados can compete experimentally with that of learning from examples, yet there is a gap to reduce for rados to be not just a good material to learn from in a privacy/distributed setting, but also a serious alternative to learning from examples at large, yielding new avenues to supervised learning. Second, theoretically speaking, it is now known that two widely popular losses over examples admit an equivalent loss in the rado world: the logistic loss and the square loss [Nock et al., 2015, Patrini et al., 2016]. This inevitably suggests that this property may hold for more losses, yet barely anything displays patterns of generalizability in the existing proofs.

Our contributions: in this paper, we provide answers to these two questions, with three main contributions. Our ﬁrst contribution is to show that this generalization indeed holds: other example losses admit equivalent losses in the rado world, meaning in particular that their minimiser classiﬁer is the same, regardless of the dataset of examples. The technique we use exploits a two-player zero sum game representation of convex losses, that has been very useful to analyse boosting algorithms [Schapire, 2003, Telgarsky, 2012], with one key difference: payoffs are non-linear convex, eventually non-differentiable. These also resemble the entropic dual losses [Reid et al., 2015], with the difference that we do not enforce conjugacy over the simplex. The conditions of the game are slightly different for examples and rados. We provide necessary and sufﬁcient conditions for the resulting losses over examples and rados to be equivalent. Informally, equivalence happens iff the convex functions of the games satisfy a symmetry relationship and the weights satisfy a linear system of equations. Some popular losses ﬁt in the equivalence [Nair and Hinton, 2010, Gentile and Warmuth, 1998, Nock and Nielsen, 2008, Telgarsky, 2012, Vapnik, 1998, van Rooyen et al., 2015].

Our second contribution came unexpectedly through this equivalence. Regularizing a loss is standard in machine learning [Bach et al., 2011]. We show a sufﬁcient condition for the equivalence under which regularizing the example loss is equivalent to regularizing the rados in the equivalent rado loss, i.e. making a Minkowski sum of the rado set with a classiﬁer-based set. This property is independent of the regularizer, and incidentally happens to hold for all our cases of equivalence (Cf ﬁrst contribution). A regularizer added to a loss over examples thus transfers to data in the rado world, in essentially the same way for all regularizers, and if one can solve the non-trivial computational and optimization problem that poses this data modiﬁcation for one regularized rado loss, then, basically,

"A good optimization algorithm for this regularized rado loss may ﬁt to other regularizers as well

Our third contribution exempliﬁes this. We propose an iterative boosting algorithm, Ω-R.ADABOOST, that learns a classiﬁer from rados using the exponential regularized rado loss, with regularization choice belonging to the ridge, lasso, ℓ , or the recently coined SLOPE [Bogdan et al., 2015]. Since rado regularization would theoretically require to modify data at each iteration, such schemes are computationally non-trivial. We show that this modiﬁcation can in fact be bypassed for the exponential rado loss, and the algorithm, Ω-R.ADABOOST, is as fast as ADABOOST. Ω-R.ADABOOST has however a key advantage over ADABOOST that to our knowledge is new in the boosting world: for any of these four regularizers, Ω-R.ADABOOST is a boosting algorithm thus, because of the equivalence between the minimization of the logistic loss over examples and the minimization of the exponential rado loss, Ω-R.ADABOOST is in fact an efﬁcient proxy to boost the regularized logistic loss over examples using whichever of the four regularizers, and by extension, linear combination of them (e.g., elastic net regularization [Zou and Hastie, 2005]). We are not aware of any regularized logistic loss formal boosting algorithm with such a wide spectrum of regularizers. Extensive experiments validate this property: Ω-R.ADABOOST is all the better vs ADABOOST (unregularized or regularized) as the domain gets larger, and is able to rapidly learn both accurate and sparse classiﬁers, making it an especially good contender for supervised learning at large on big domains.

The rest of this paper is as follows. Sections 2, 3 and 4 respectively present the equivalence between example and rado losses, its extension to regularized learning and Ω-R.ADABOOST. 5 and 6 respectively present experiments, and conclude. In order not to laden the paper s body, a Supplementary Material (SM) contains the proofs and additional theoretical and experimental results.

2 Games and equivalent example/rado losses

To avoid notational load, we brieﬂy present our learning setting to point the key quantity in our formulation of the general two players game. Let [m] .= {1, 2, ..., m} and Σm .= { 1, 1}m, for m > 0. The classical (batch) supervised learner is example-based: it is given a set of examples S = {(xi, yi), i [m]} where xi Rd, yi Σ1, i [m]. It returns a classiﬁer h : Rd R from

a predeﬁned set H. Let zi(h) .= yh(xi) and abbreviate z(h) by z for short. The learner ﬁts h to the minimization of a loss. Table 1, column ℓe, presents some losses that can be used: we remark that h appears only through z, so let us consider in this section that the learner rather ﬁts vector z Rm.

We can now deﬁne our two players game setting. Let ϕe : R R and ϕr : R R two convex and lower-semicontinuous generators. We deﬁne functions Le : Rm Rm R and Lr : R2m Rm R:

Le(p, z) .= X

i [m] pizi + µe X

i [m] ϕe(pi) , (1)

Lr(q, z) .= X

I [m] q I X

i I zi + µr X

I [m] ϕr(q I) , (2)

where µe, µr > 0 do not depend on z. For the notation to be meaningful, the coordinates in q are assumed (wlog) to be in bijection with 2[m]. The dependence of both problems in their respective generators is implicit and shall be clear from context. The adversary s goal is to ﬁt

p (z) .= arg min p Rm Le(p, z) , (3)

q (z) .= arg min q H2m Lr(q, z) , (4)

with H2m .= {q R2m : 1 q = 1}, so as to attain

Le(z) .= Le(p (z), z) , (5) Lr(z) .= Lr(q (z), z) , (6)

and let Le(z) and Lr(z) denote their subdifferentials. We view the learner s task as the problem of maximising the corresponding problems in eq. (5) (with examples; this is already sketched above) or (6) (with what we shall call Rademacher observations, or rados), or equivalently minimising negative the corresponding function, and then resort to a loss function. The question of when these two problems are equivalent from the learner s standpoint motivates the following deﬁnition.

Deﬁnition 1 Two generators ϕe, ϕr are said proportionate iff m > 0, there exists (µe, µr) such that

Le(z) = Lr(z) + b , z Rm . (7)

(b does not depend on z) m N , let

Gm .= 0 2m 1 1 2m 1 Gm 1 Gm 1

( {0, 1}m 2m) (8)

if m > 1, and G1 .= [0 1] otherwise (notation zd indicates a vector in Rd).

Theorem 2 ϕe, ϕr are proportionate iff the optima p (z) and q (z) to eqs (3) and (4) satisfy:

p (z) Lr(z) , (9)

Gmq (z) Le(z) . (10)

If ϕe, ϕr are differentiable and strictly convex, they are proportionate iff p (z) = Gmq (z).

We can alleviate the fact that convexity is strict, which results in a set-valued identity for ϕe, ϕr to be proportionate. This gives a necessary and sufﬁcient condition for two generators to be proportionate. It does not say how to construct one from the other, if possible. We now show that it is indeed possible and prune the search space: if ϕe is proportionate to some ϕr, then it has to be a symmetrized version of ϕr, according to the following deﬁnition.

Deﬁnition 3 Let ϕr s.t. domϕr (0, 1). ϕs(r)(z) .= ϕr(z) + ϕr(1 z) is the symmetrisation of ϕr.

Lemma 4 If ϕe and ϕr are proportionate, then ϕe(z) = (µr/µe) ϕs(r)(z) + (b/µe) (b is in (7)).

# ℓe(z, µe) ℓr(z, µr) ϕr(z) µe and µr ae I P

i [m] log (1 + exp (ze i )) P

I [m] exp (zr I) z log z z µe = µr µe II P

i [m] (1 + ze i )2 (EI [ zr I] µr VI [ zr I]) (1/2) z2 µe = µr µe/4 III P

i [m] max {0, ze i } max 0, max I [m]{zr I}

χ[0,1](z) µe, µr µe IV P

i ze i EI [zr I] χ[ 1 2m , 1

2](z) µe, µr µe

Table 1: Examples of equivalent example and rado losses. Names of the rado-losses ℓr(z, µr) are respectively the Exponential (I), Mean-variance (II), Re LU (III) and Unhinged (IV) rado loss. We use shorthands ze i .= (1/µe) zi and zr I .= (1/µr) P

i I zi. Parameter ae appears in eq. (14). Column µe and µr gives the constraints for the equivalence to hold. EI and VI are the expectation and variance over uniform sampling in sets I [m] (see text for details).

To summarize, ϕe and ϕr are proportionate iff (i) they meet the structural property that ϕe is (proportional to) the symmetrized version of ϕr (according to Deﬁnition 3), and (ii) the optimal solutions p (z) and q (z) to problems (1) and (2) satisfy the conditions of Theorem 2. Depending on the direction, we have two cases to craft proportionate generators. First, if we have ϕr, then necessarily ϕe ϕs(r) so we merely have to check Theorem 2. Second, if we have ϕe, then it matches Deﬁnition 31. In this case, we have to ﬁnd ϕr = f + g where g(z) = g(1 z) and ϕe(z) = f(z) + f(1 z). We now come back to Le(z), Lr(z) (Deﬁnition 1), and make the connection with example and rado losses. In the next deﬁnition, an e-loss ℓe(z) is a function deﬁned over the coordinates of z, and a

r-loss ℓr(z) is a function deﬁned over the subsets of sums of coordinates. Functions can depend on other parameters as well.

Deﬁnition 5 Suppose e-loss ℓe(z) and r-loss ℓr(z) are such that there exist (i) fe : R R and fr(z) : R R both strictly increasing and such that z Rm,

Le(z) = fe (ℓe(z)) , (11) Lr(z) = fr (ℓr(z)) , (12)

where Le(z) and Lr(z) are deﬁned via two proportionate generators ϕe and ϕr (Deﬁnition 1). Then the couple (ℓe, ℓr) is called a couple of equivalent example-rado losses.

Following is the main Theorem of this Section, which summarizes all the cases of equivalence between example and rado losses, and shows that the theory developed on example / rado losses with proportionate generators encompasses the speciﬁc proofs and cases already known [Nock et al., 2015, Patrini et al., 2016]. Table 1 also displays generator ϕr.

Theorem 6 In each row of Table 1, ℓe(z, µe) and ℓr(z, µr) are equivalent for µe and µr as indicated.

The proof (SM, Subsection 2.3) details for each case the proportionate generators ϕe and ϕr.

3 Learning with (rado) regularized losses

We now detail further the learning setting. In the preceeding Section, we have deﬁnef zi(h) .= yh(xi), which we plug in the losses of Table 1 to obtain the corresponding example and rado losses. Losses simplify conveniently when H consists of linear classiﬁers, h(x) .= θ x for some θ Θ Rd. In this case, the example loss can be described using edge vectors Se .= {yi xi, i = 1, 2, ..., m} since zi = θ (yi xi), and the rado loss can be described using rademacher observations [Nock et al., 2015], since P

i I zi = θ πσ for σi = yi iff i I (and yi otherwise) and πσ .= (1/2) P

i(σi + yi) xi. Let us deﬁne S r .= {πσ, σ Σm} the set of all rademacher observations. We rewrite any couple of equivalent example and rado losses as ℓe(Se, θ) and ℓr(S r , θ) respectively2, omitting parameters µe and µr, assumed to be ﬁxed beforehand for the equivalence to hold (see Table 1). Let us regularize the example loss, so that the learner s goal is to minimize

ℓe(Se, θ, Ω) .= ℓe(Se, θ) + Ω(θ) , (13)

1Alternatively, ϕe is permissible [Kearns and Mansour, 1999]. 2To prevent notational overload, we blend notions of (pointwise) loss and (samplewise) risk, as just losses .

Algorithm 1 Ω-R.ADABOOST

Input set of rados Sr .= {π1, π2, ..., πn}; T N ; parameters γ (0, 1), ω R+; Step 1 : let θ0 0, w0 (1/n)1 ; Step 2 : for t = 1, 2, ..., T

Step 2.1 : call the weak learner: (ι(t), rt) Ω-WL(Sr, wt, γ, ω, θt 1); Step 2.2 : compute update parameters αι(t) and δt (here, π k .= maxj |πjk|):

αι(t) (1/(2π ι(t))) log((1 + rt)/(1 rt)) and δt ω (Ω(θt) Ω(θt 1)) ; (16)

Step 2.3 : update and normalize weights: for j = 1, 2, ..., n,

wtj w(t 1)j exp αtπjι(t) + δt /Zt ; (17)

Return θT ;

with Ωa regularizer [Bach et al., 2011]. The following shows that when fe in eq. (11) is linear, there is a rado-loss equivalent to this regularized loss, regardless of Ω.

Theorem 7 Suppose H contains linear classiﬁers. Let (ℓe(Se, θ), ℓr(S r , θ)) be any couple of equivalent example-rado losses such that fe in eq. (11) is linear:

fe(z) = ae z + be , (14)

for some ae > 0, be R. Then for any regularizer Ω(.) (assuming wlog Ω(0) = 0), the regularized example loss ℓe(Se, θ, Ω) is equivalent to rado loss ℓr(S ,Ω,θ r , θ) computed over regularized rados:

S ,Ω,θ r .= S r { Ω(θ) θ} , (15)

Here, is Minkowski sum and Ω(θ) .= ae Ω(θ)/ θ 2 2 if θ = 0 (and 0 otherwise).

Theorem 7 applies to all rado losses (I-IV) in Table 1. The effect of regularization on rados is intuitive from the margin standpoint: assume that a good classiﬁer θ is one that ensures lowerbounded inner products θ z τ for some margin threshold τ. Then any good classiﬁer on a regularized rado πσ shall actually meet, over examples, P i:yi=σi θ (yi xi) τ + ae Ω(θ). This inequality ties an "accuracy" of θ (edges, left hand-side) and its sparsity (right-hand side). Clearly, Theorem 7 has an unfamiliar shape since regularisation modiﬁes data in the rado world: a different θ, or a different Ω, yields a different S ,Ω,θ r , and therefore it may seem very tricky to minimize such a regularized loss. Even more, iterative algorithms like boosting algorithms look at ﬁrst glance a poor choice, since any update on θ implies an update on the rados as well. What we show in the following Section is essentially the opposite for the exponential rado loss, and a generalization of the RADOBOOST algorithm of Nock et al. [2015], which does not modify rados, is a formal boosting algorithm for a broad set of regularizers. Also, remarkably, only the high-level code of the weak learner depends on the regularizer; that of the strong learner is not affected.

4 Boosting with (rado) regularized losses

Ω-R.ADABOOST presents our approach to learning with rados regularized with regularizer Ωto minimise loss ℓexp r (Sr, θ, Ω) in eq. (45). Classiﬁer θt is deﬁned as θt .= Pt t =1 αι(t ) 1ι(t ), where 1k is the kth canonical basis vector. The expected edge rt used to compute αt in eq. (16) is based on the following basis assignation:

rι(t) 1 π ι(t)

j=1 wtjπjι(t) ( [ 1, 1]) . (19)

The computation of rt is eventually tweaked by the weak learner, as displayed in Algorithm ΩWL. We investigate four choices for Ω. For each of them, we prove the boosting ability of ΩR.ADABOOST (Γ is symmetric positive deﬁnite, Sd is the symmetric group of order d, |θ| is the

Algorithm 2 Ω-WL, for Ω { . 1, . 2 Γ, . , . Φ} Input set of rados Sr .= {π1, π2, ..., πn}; weights w n; parameters γ (0, 1), ω R+; classiﬁer θ Rd; Step 1 : pick weak feature ι [d];

Optional use preference order: ι ι |rι| δι |rι | δι // δι .= ω (Ω(θ + αι 1ι) Ω(θ)), rι is given in (19) and αι is given in (16) Step 2 : if Ω= . 2 Γ then

r rι if rι [ γ, γ] sign (rι ) γ otherwise ; (18)

else r rι ; Return (ι , r );

vector whose coordinates are the absolute values of the coordinates of θ):

θ 1 .= |θ| 1 Lasso θ 2 Γ .= θ Γθ Ridge θ .= maxk |θk| ℓ θ Φ .= max M Sd(M|θ|) ξ SLOPE

[Bach et al., 2011, Bogdan et al., 2015, Duchi and Singer, 2009, Su and Candès, 2015]. The coordinates of ξ in SLOPE are ξk .= Φ 1(1 kq/(2d)) where Φ 1(.) is the quantile of the standard normal distribution and q (0, 1); thus, the largest coordinates (in absolute value) of θ are more penalized. We now establish the boosting ability of Ω-R.ADABOOST. We give no direction for Step 1 in Ω-WL, which is consistent with the deﬁnition of a weak learner in the boosting theory: all we require from the weak learner is |r.| no smaller than some weak learning threshold γWL > 0.

Deﬁnition 8 Fix any constant γWL (0, 1). Ω-WL is said to be a γWL-Weak Learner iff the feature ι(t) it picks at iteration t satisﬁes |rι(t)| γWL, for any t = 1, 2, ..., T.

We also provide an optional step for the weak learner in Ω-WL, which we exploit in the experimentations, which gives a total preference order on features to optimise further Ω-R.ADABOOST.

Theorem 9 (boosting with ridge). Take Ω(.) = . 2 Γ. Fix any 0 < a < 1/5, and suppose that ω and the number of iterations T of Ω-R.ADABOOST are chosen so that

ω < (2a min k max j π2 jk)/(TλΓ) , (21)

where λΓ > 0 is the largest eigenvalue of Γ. Then there exists some γ > 0 (depending on a, and given to Ω-WL) such that for any ﬁxed 0 < γWL < γ, if Ω-WL is a γWL-Weak Learner, then Ω-R.ADABOOST returns at the end of the T boosting iterations a classiﬁer θT which meets:

ℓexp r (Sr, θT , . 2 Γ) exp( aγ2 WLT/2) . (22)

Furthermore, if we ﬁx a = 1/7, then we can ﬁx γ = 0.98, and if a = 1/10, then we can ﬁx γ = 0.999.

Two remarks are in order. First, the cases a = 1/7, 1/10 show that Ω-WL can still obtain large edges in eq. (19), so even a strong weak learner might ﬁt in for Ω-WL, without clamping edges. Second, the right-hand side of ineq. (21) may be very large if we consider that mink maxj π2 jk may be proportional to m2. So the constraint on ω is in fact loose.

Theorem 10 (boosting with lasso or ℓ ). Take Ω(.) { . 1, . }. Suppose Ω-WL is a γWL-Weak Learner for some γWL > 0. Suppose 0 < a < 3/11 s. t. ω satisﬁes:

ω = aγWL min k max j |πjk| . (23)

Then Ω-R.ADABOOST returns at the end of the T boosting iterations a classiﬁer θT which meets:

ℓexp r (Sr, θT , Ω) exp( Tγ2 WL/2) , (24)

where T = aγWLT if Ω= . 1, and T = (T T ) + aγWL T if Ω= . ; T is the number of iterations where the feature computing the ℓ norm was updated3.

We ﬁnally investigate the SLOPE choice. The Theorem is proven for ω = 1 in Ω-R.ADABOOST, for two reasons: it matches the original deﬁnition [Bogdan et al., 2015] and furthermore it unveils an interesting connection between boosting and SLOPE properties.

Theorem 11 (boosting with SLOPE). Take Ω(.) = . Φ. Let a .= min{3γWL/11, Φ 1(1 q/(2d))/ mink maxj |πjk|}. Suppose wlog |θT k| |θT (k+1)|, k, and ﬁx ω = 1. Suppose (i) Ω-WL is a γWL-Weak Learner for some γWL > 0, and (ii) the q-value is chosen to meet:

11 max j |πjk| k

Then classiﬁer θT returned by Ω-R.ADABOOST at the end of the T boosting iterations satisﬁes:

ℓexp r (Sr, θT , . Φ) exp( aγ2 WLT/2) . (25)

Constraint (ii) on q is interesting in the light of the properties of SLOPE [Su and Candès, 2015]. Modulo some assumptions, SLOPE yields a control the false discovery rate (FDR) i.e., negligible coefﬁcients in the "true linear model θ that are found signiﬁcant in the learned θ . Constraint (ii) links the "small achievable FDR (upperbounded by q) to the "boostability of the data: the fact that each feature k can be chosen by the weak learner for a "large γWL, or has maxj |πjk| large, precisely ﬂags potential signiﬁcant features, thus reducing the risk of sparsity errors, and allowing small q, which is constraint (ii). Using the second order approximation of normal quantiles [Su and Candès, 2015], a sufﬁcient condition for (ii) is that, for some K > 0,

γWL min j max j |πjk| K p

log d + log q 1 ; (26)

but minj maxj |πjk| is proportional to m, so ineq. (26), and thus (ii), may hold even for small samples and q-values. An additional Theorem deferred to SM sor space considerations shows that for any applicable choice of regularization (eq. 20), the regularized log-loss of θT over examples enjoys with high probability a monotonically decreasing upperbound with T as: ℓlog e (Se, θ, Ω) log 2 κ T + τ(m), with τ(m) 0 when m (and τ does not depend on T), and κ > 0 does not depend on T. Hence, Ω-R.ADABOOST is an efﬁcient proxy to boost the regularized log-loss over examples, using whichever of the ridge, lasso, ℓ or SLOPE regularization establishing the ﬁrst boosting algorithm for this choice , or linear combinations of the choices, e.g. for elastic nets. If we were to compare Theorems 9 11 (eqs (22, 24, 25)), then the convergence looks best for ridge (the unsigned exponent is O(γ2 WL)) while it looks slightly worse for ℓ and SLOPE (the unsigned exponent is now O(γ3 WL)), the lasso being in between.

5 Experiments

We have implemented Ω-WL4 using the order suggested to retrieve the topmost feature in the order. Hence, the weak learner returns the feature maximising |rι| δι. The rationale for this comes from the proofs of Theorems 9 11, showing that Q

t exp( (r2 ι(t)/2 δι(t))) is an upperbound on the exponential regularized rado-loss. We do not clamp the weak learner for Ω(.) = . 2 Γ, so the weak learner is restricted to Step 1 in Ω-WL5.

The objective of these experiments is to evaluate Ω-R.ADABOOST as a contender for supervised learning per se. We compared Ω-R.ADABOOST to ADABOOST/ℓ1 regularized-ADABOOST [Schapire and Singer, 1999, Xi et al., 2009]. All algorithms are run for a total of T = 1000 iterations, and at the end of the iterations, the classiﬁer in the sequence that minimizes the empirical loss is kept. Notice therefore that rado-based classiﬁers are evaluated on the training set which computes the

3If several features match this criterion, T is the total number of iterations for all these features. 4Code available at: http://users.cecs.anu.edu.au/ rnock/ 5the values for ω that we test, in {10 u, u {0, 1, 2, 3, 4, 5}}, are small with respect to the upperbound in ineq. (21) given the number of boosting steps (T = 1000), and would yield on most domains a maximal γ 1.

rados. To obtain very sparse solutions for regularized-ADABOOST, we pick its ω (β in [Xi et al., 2009]) in {10 4, 1, 104}. The complete results aggregate experiments on twenty (20) domains, all but one coming from the UCI [Bache and Lichman, 2013] (plus the Kaggle competition domain Give me some credit ), with up to d =500+ features and m =100 000+ examples. Two tables, in the SM (Tables 1 and 2 in Section 3) report respectively the test errors and sparsity of classiﬁers, whose summary is given here in Table 2. The experimental setup is a ten-folds stratiﬁed cross validation for all algorithms and each domain. ADABOOST/regularized-ADABOOST is trained using the complete training fold. When the domain size m 40000, the number of rados n used for Ω-R.ADABOOST is a random subset of rados of size equal to that of the training fold. When the domain size exceeds 40000, a random set of n = 10000 rados is computed from the training fold. Thus, (i) there is no optimisation of the examples chosen to compute rados, (ii) we always keep a very small number of rados compared to the maximum available, and (iii) when the domain size gets large, we keep a comparatively tiny number of rados. Hence, the performances of Ω-R.ADABOOST do not stem from any optimization in the choice or size of the rado sample.

Ada . 2 Id . 1 . . Φ Ada 11 10 10 8 9 9 3 3 2 1 . 2 Id 10 17 11 9 7 . 1 10 17 7 7 4 . 11 18 9 9 8 . Φ 10 19 10 10 11

Table 2: Number of domains for which algorithm in row beats algorithm in column (Ada = best result of AD-

ABOOST, = Ω-R.ADABOOST not regularized, see text).

Experiments support several key observations. First, regularization consistently reduces the test error of Ω-R.ADABOOST, by more than 15% on Magic, and 20% on Kaggle. In Table 2, Ω-R.ADABOOST unregularized (" ") is virtually always beaten by its SLOPE regularized version. Second, Ω-R.ADABOOST is able to obtain both very sparse and accurate classiﬁers (Magic, Hardware, Marketing, Kaggle). Third, Ω-R.ADABOOST competes or beats ADABOOST on all domains, and is all the better as the domain gets bigger. Even qualitatively as seen in Table 2, the best result obtained by ADABOOST (regularized or not) does not manage to beat any of the regularized versions of Ω-R.ADABOOST on the majority of the domains. Fourth, it is important to have several choices of regularizers at hand. On domain Statlog, the difference in test error between the worst and the best regularization of Ω-R.ADABOOST exceeds 15%. Fifth, as already remarked [Nock et al., 2015], signiﬁcantly subsampling rados (e.g. Marketing, Kaggle) still yields very accurate classiﬁers. Sixth, regularization in Ω-R.ADABOOST successfully reduces sparsity to learn more accurate classiﬁers on several domains (Spectf, Transfusion, Hill-noise, Winered, Magic, Marketing), achieving efﬁcient adaptive sparsity control. Last, the comparatively extremely poor results of ADABOOST on the biggest domains seems to come from another advantage of rados that the theory developed so far does not take into account: on domains for which some features are signiﬁcantly correlated with the class and for which we have a large number of examples, the concentration of the expected feature value in rados seems to provide leveraging coefﬁcients that tend to have much larger (absolute) value than in ADABOOST, making the convergence of Ω-R.ADABOOST signiﬁcantly faster than ADABOOST. For example, we have checked that it takes much more than the T = 1000 iterations for ADABOOST to start converging to the results of regularized Ω-R.ADABOOST on Hardware or Kaggle.

6 Conclusion

We have shown that the recent equivalences between two example and rado losses can be uniﬁed and generalized via a principled representation of a loss function in a two-player zero-sum game. Furthermore, we have shown that this equivalence extends to regularized losses, where the regularization in the rado loss is performed over the rados themselves with Minkowski sums. Our theory and experiments on Ω-R.ADABOOST with prominent regularizers (including ridge, lasso, ℓ , SLOPE) indicate that when such a simple regularized form of the rado loss is available, it may help to devise accurate and efﬁcient workarounds to boost a regularized loss over examples via the rado loss, even when the regularizer is signiﬁcantly more involved like e.g. for group norms [Bach et al., 2011].

Acknowledgments

Thanks are due to Stephen Hardy and Giorgio Patrini for stimulating discussions around this material.

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4:1 106, 2011.

K. Bache and M. Lichman. UCI machine learning repository, 2013.

M Bogdan, E. van den Berg, C. Sabatti, W. Su, and E.-J. Candès. SLOPE adaptive variable selection via convex optimization. Annals of Applied Statistics, 2015. Also ar Xiv:1310.1969v2.

J.-C. Duchi and Y. Singer. Efﬁcient learning using forward-backward splitting. In NIPS*22, pages 495 503, 2009.

C. Gentile and M. Warmuth. Linear hinge loss and average margin. In NIPS*11, pages 225 231, 1998.

M.J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. J. Comp. Syst. Sc., 58:109 128, 1999.

V. Nair and G. Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In 27th ICML, pages 807 814, 2010.

R. Nock and F. Nielsen. On the efﬁcient minimization of classiﬁcation-calibrated surrogates. In NIPS*21, pages 1201 1208, 2008.

R. Nock, G. Patrini, and A Friedman. Rademacher observations, private data, and boosting. In 32nd ICML, pages 948 956, 2015.

G. Patrini, R. Nock, S. Hardy, and T. Caetano. Fast learning from distributed datasets without entity matching. In 26 th IJCAI, 2016.

M.-D. Reid, R.-M. Frongillo, R.-C. Williamson, and N.-A. Mehta. Generalized mixability via entropic duality. In 28th COLT, pages 1501 1522, 2015.

R.-E. Schapire. The boosting approach to machine learning: An overview. In D.-D. Denison, M.-H. Hansen, C.-C. Holmes, B. Mallick, and B. Yu, editors, Nonlinear Estimation and Classiﬁcation, volume 171 of Lecture Notes in Statistics, pages 149 171. Springer Verlag, 2003.

R. E. Schapire and Y. Singer. Improved boosting algorithms using conﬁdence-rated predictions. MLJ, 37:297 336, 1999.

W. Su and E.-J. Candès. SLOPE is adaptive to unkown sparsity and asymptotically minimax. Co RR, abs/1503.08393, 2015.

M. Telgarsky. A primal-dual convergence analysis of boosting. JMLR, 13:561 606, 2012.

B. van Rooyen, A. Menon, and R.-C. Williamson. Learning with symmetric label noise: The importance of being unhinged. In NIPS*28, 2015.

V. Vapnik. Statistical Learning Theory. John Wiley, 1998.

Y.-T. Xi, Z.-J. Xiang, P.-J. Ramadge, and R.-E. Schapire. Speed and sparsity of regularized boosting. In 12th AISTATS, pages 615 622, 2009.

H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B, 67:301 321, 2005.