# on_regularizing_rademacher_observation_losses__8a7202f6.pdf On Regularizing Rademacher Observation Losses Richard Nock Data61, The Australian National University & The University of Sydney richard.nock@data61.csiro.au It has recently been shown that supervised learning linear classifiers with two of the most popular losses, the logistic and square loss, is equivalent to optimizing an equivalent loss over sufficient statistics about the class: Rademacher observations (rados). It has also been shown that learning over rados brings solutions to two prominent problems for which the state of the art of learning from examples can be comparatively inferior and in fact less convenient: (i) protecting and learning from private examples, (ii) learning from distributed datasets without entity resolution. Bis repetita placent: the two proofs of equivalence are different and rely on specific properties of the corresponding losses, so whether these can be unified and generalized inevitably comes to mind. This is our first contribution: we show how they can be fit into the same theory for the equivalence between example and rado losses. As a second contribution, we show that the generalization unveils a surprising new connection to regularized learning, and in particular a sufficient condition under which regularizing the loss over examples is equivalent to regularizing the rados (i.e. the data) in the equivalent rado loss, in such a way that an efficient algorithm for one regularized rado loss may be as efficient when changing the regularizer. This is our third contribution: we give a formal boosting algorithm for the regularized exponential rado-loss which boost with any of the ridge, lasso, SLOPE, ℓ , or elastic net regularizer, using the same master routine for all. Because the regularized exponential rado-loss is the equivalent of the regularized logistic loss over examples we obtain the first efficient proxy to the minimization of the regularized logistic loss over examples using such a wide spectrum of regularizers. Experiments with a readily available code display that regularization significantly improves rado-based learning and compares favourably with example-based learning. 1 Introduction What kind of data should we use to train a supervised learner ? A recent result has shown that minimising the popular logistic loss over examples with linear classifiers (in supervised learning) is equivalent to the minimisation of the exponential loss over sufficient statistics about the class known as Rademacher observations (rados, [Nock et al., 2015]), for the same classifier. In short, we fit a classifier over data that is different from examples, and the same classifier generalizes well to new observations. It has been shown that rados offer solutions for two problems for which the state of the art involving examples can be comparatively significantly inferior: protection of the examples privacy from various algebraic, geometric, statistical and computational standpoints, and learning from private data [Nock et al., 2015]; learning from a large number of distributed datasets without having to perform entity resolution between datasets [Patrini et al., 2016]. Quite remarkably, the training time of the algorithms involved can be smaller than it would be on examples, by orders of magnitude [Patrini et al., 2016]. Two key problems remain however: the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. accuracy of learning from rados can compete experimentally with that of learning from examples, yet there is a gap to reduce for rados to be not just a good material to learn from in a privacy/distributed setting, but also a serious alternative to learning from examples at large, yielding new avenues to supervised learning. Second, theoretically speaking, it is now known that two widely popular losses over examples admit an equivalent loss in the rado world: the logistic loss and the square loss [Nock et al., 2015, Patrini et al., 2016]. This inevitably suggests that this property may hold for more losses, yet barely anything displays patterns of generalizability in the existing proofs. Our contributions: in this paper, we provide answers to these two questions, with three main contributions. Our first contribution is to show that this generalization indeed holds: other example losses admit equivalent losses in the rado world, meaning in particular that their minimiser classifier is the same, regardless of the dataset of examples. The technique we use exploits a two-player zero sum game representation of convex losses, that has been very useful to analyse boosting algorithms [Schapire, 2003, Telgarsky, 2012], with one key difference: payoffs are non-linear convex, eventually non-differentiable. These also resemble the entropic dual losses [Reid et al., 2015], with the difference that we do not enforce conjugacy over the simplex. The conditions of the game are slightly different for examples and rados. We provide necessary and sufficient conditions for the resulting losses over examples and rados to be equivalent. Informally, equivalence happens iff the convex functions of the games satisfy a symmetry relationship and the weights satisfy a linear system of equations. Some popular losses fit in the equivalence [Nair and Hinton, 2010, Gentile and Warmuth, 1998, Nock and Nielsen, 2008, Telgarsky, 2012, Vapnik, 1998, van Rooyen et al., 2015]. Our second contribution came unexpectedly through this equivalence. Regularizing a loss is standard in machine learning [Bach et al., 2011]. We show a sufficient condition for the equivalence under which regularizing the example loss is equivalent to regularizing the rados in the equivalent rado loss, i.e. making a Minkowski sum of the rado set with a classifier-based set. This property is independent of the regularizer, and incidentally happens to hold for all our cases of equivalence (Cf first contribution). A regularizer added to a loss over examples thus transfers to data in the rado world, in essentially the same way for all regularizers, and if one can solve the non-trivial computational and optimization problem that poses this data modification for one regularized rado loss, then, basically, "A good optimization algorithm for this regularized rado loss may fit to other regularizers as well Our third contribution exemplifies this. We propose an iterative boosting algorithm, Ω-R.ADABOOST, that learns a classifier from rados using the exponential regularized rado loss, with regularization choice belonging to the ridge, lasso, ℓ , or the recently coined SLOPE [Bogdan et al., 2015]. Since rado regularization would theoretically require to modify data at each iteration, such schemes are computationally non-trivial. We show that this modification can in fact be bypassed for the exponential rado loss, and the algorithm, Ω-R.ADABOOST, is as fast as ADABOOST. Ω-R.ADABOOST has however a key advantage over ADABOOST that to our knowledge is new in the boosting world: for any of these four regularizers, Ω-R.ADABOOST is a boosting algorithm thus, because of the equivalence between the minimization of the logistic loss over examples and the minimization of the exponential rado loss, Ω-R.ADABOOST is in fact an efficient proxy to boost the regularized logistic loss over examples using whichever of the four regularizers, and by extension, linear combination of them (e.g., elastic net regularization [Zou and Hastie, 2005]). We are not aware of any regularized logistic loss formal boosting algorithm with such a wide spectrum of regularizers. Extensive experiments validate this property: Ω-R.ADABOOST is all the better vs ADABOOST (unregularized or regularized) as the domain gets larger, and is able to rapidly learn both accurate and sparse classifiers, making it an especially good contender for supervised learning at large on big domains. The rest of this paper is as follows. Sections 2, 3 and 4 respectively present the equivalence between example and rado losses, its extension to regularized learning and Ω-R.ADABOOST. 5 and 6 respectively present experiments, and conclude. In order not to laden the paper s body, a Supplementary Material (SM) contains the proofs and additional theoretical and experimental results. 2 Games and equivalent example/rado losses To avoid notational load, we briefly present our learning setting to point the key quantity in our formulation of the general two players game. Let [m] .= {1, 2, ..., m} and Σm .= { 1, 1}m, for m > 0. The classical (batch) supervised learner is example-based: it is given a set of examples S = {(xi, yi), i [m]} where xi Rd, yi Σ1, i [m]. It returns a classifier h : Rd R from a predefined set H. Let zi(h) .= yh(xi) and abbreviate z(h) by z for short. The learner fits h to the minimization of a loss. Table 1, column ℓe, presents some losses that can be used: we remark that h appears only through z, so let us consider in this section that the learner rather fits vector z Rm. We can now define our two players game setting. Let ϕe : R R and ϕr : R R two convex and lower-semicontinuous generators. We define functions Le : Rm Rm R and Lr : R2m Rm R: Le(p, z) .= X i [m] pizi + µe X i [m] ϕe(pi) , (1) Lr(q, z) .= X I [m] q I X i I zi + µr X I [m] ϕr(q I) , (2) where µe, µr > 0 do not depend on z. For the notation to be meaningful, the coordinates in q are assumed (wlog) to be in bijection with 2[m]. The dependence of both problems in their respective generators is implicit and shall be clear from context. The adversary s goal is to fit p (z) .= arg min p Rm Le(p, z) , (3) q (z) .= arg min q H2m Lr(q, z) , (4) with H2m .= {q R2m : 1 q = 1}, so as to attain Le(z) .= Le(p (z), z) , (5) Lr(z) .= Lr(q (z), z) , (6) and let Le(z) and Lr(z) denote their subdifferentials. We view the learner s task as the problem of maximising the corresponding problems in eq. (5) (with examples; this is already sketched above) or (6) (with what we shall call Rademacher observations, or rados), or equivalently minimising negative the corresponding function, and then resort to a loss function. The question of when these two problems are equivalent from the learner s standpoint motivates the following definition. Definition 1 Two generators ϕe, ϕr are said proportionate iff m > 0, there exists (µe, µr) such that Le(z) = Lr(z) + b , z Rm . (7) (b does not depend on z) m N , let Gm .= 0 2m 1 1 2m 1 Gm 1 Gm 1 ( {0, 1}m 2m) (8) if m > 1, and G1 .= [0 1] otherwise (notation zd indicates a vector in Rd). Theorem 2 ϕe, ϕr are proportionate iff the optima p (z) and q (z) to eqs (3) and (4) satisfy: p (z) Lr(z) , (9) Gmq (z) Le(z) . (10) If ϕe, ϕr are differentiable and strictly convex, they are proportionate iff p (z) = Gmq (z). We can alleviate the fact that convexity is strict, which results in a set-valued identity for ϕe, ϕr to be proportionate. This gives a necessary and sufficient condition for two generators to be proportionate. It does not say how to construct one from the other, if possible. We now show that it is indeed possible and prune the search space: if ϕe is proportionate to some ϕr, then it has to be a symmetrized version of ϕr, according to the following definition. Definition 3 Let ϕr s.t. domϕr (0, 1). ϕs(r)(z) .= ϕr(z) + ϕr(1 z) is the symmetrisation of ϕr. Lemma 4 If ϕe and ϕr are proportionate, then ϕe(z) = (µr/µe) ϕs(r)(z) + (b/µe) (b is in (7)). # ℓe(z, µe) ℓr(z, µr) ϕr(z) µe and µr ae I P i [m] log (1 + exp (ze i )) P I [m] exp (zr I) z log z z µe = µr µe II P i [m] (1 + ze i )2 (EI [ zr I] µr VI [ zr I]) (1/2) z2 µe = µr µe/4 III P i [m] max {0, ze i } max 0, max I [m]{zr I} χ[0,1](z) µe, µr µe IV P i ze i EI [zr I] χ[ 1 2m , 1 2](z) µe, µr µe Table 1: Examples of equivalent example and rado losses. Names of the rado-losses ℓr(z, µr) are respectively the Exponential (I), Mean-variance (II), Re LU (III) and Unhinged (IV) rado loss. We use shorthands ze i .= (1/µe) zi and zr I .= (1/µr) P i I zi. Parameter ae appears in eq. (14). Column µe and µr gives the constraints for the equivalence to hold. EI and VI are the expectation and variance over uniform sampling in sets I [m] (see text for details). To summarize, ϕe and ϕr are proportionate iff (i) they meet the structural property that ϕe is (proportional to) the symmetrized version of ϕr (according to Definition 3), and (ii) the optimal solutions p (z) and q (z) to problems (1) and (2) satisfy the conditions of Theorem 2. Depending on the direction, we have two cases to craft proportionate generators. First, if we have ϕr, then necessarily ϕe ϕs(r) so we merely have to check Theorem 2. Second, if we have ϕe, then it matches Definition 31. In this case, we have to find ϕr = f + g where g(z) = g(1 z) and ϕe(z) = f(z) + f(1 z). We now come back to Le(z), Lr(z) (Definition 1), and make the connection with example and rado losses. In the next definition, an e-loss ℓe(z) is a function defined over the coordinates of z, and a r-loss ℓr(z) is a function defined over the subsets of sums of coordinates. Functions can depend on other parameters as well. Definition 5 Suppose e-loss ℓe(z) and r-loss ℓr(z) are such that there exist (i) fe : R R and fr(z) : R R both strictly increasing and such that z Rm, Le(z) = fe (ℓe(z)) , (11) Lr(z) = fr (ℓr(z)) , (12) where Le(z) and Lr(z) are defined via two proportionate generators ϕe and ϕr (Definition 1). Then the couple (ℓe, ℓr) is called a couple of equivalent example-rado losses. Following is the main Theorem of this Section, which summarizes all the cases of equivalence between example and rado losses, and shows that the theory developed on example / rado losses with proportionate generators encompasses the specific proofs and cases already known [Nock et al., 2015, Patrini et al., 2016]. Table 1 also displays generator ϕr. Theorem 6 In each row of Table 1, ℓe(z, µe) and ℓr(z, µr) are equivalent for µe and µr as indicated. The proof (SM, Subsection 2.3) details for each case the proportionate generators ϕe and ϕr. 3 Learning with (rado) regularized losses We now detail further the learning setting. In the preceeding Section, we have definef zi(h) .= yh(xi), which we plug in the losses of Table 1 to obtain the corresponding example and rado losses. Losses simplify conveniently when H consists of linear classifiers, h(x) .= θ x for some θ Θ Rd. In this case, the example loss can be described using edge vectors Se .= {yi xi, i = 1, 2, ..., m} since zi = θ (yi xi), and the rado loss can be described using rademacher observations [Nock et al., 2015], since P i I zi = θ πσ for σi = yi iff i I (and yi otherwise) and πσ .= (1/2) P i(σi + yi) xi. Let us define S r .= {πσ, σ Σm} the set of all rademacher observations. We rewrite any couple of equivalent example and rado losses as ℓe(Se, θ) and ℓr(S r , θ) respectively2, omitting parameters µe and µr, assumed to be fixed beforehand for the equivalence to hold (see Table 1). Let us regularize the example loss, so that the learner s goal is to minimize ℓe(Se, θ, Ω) .= ℓe(Se, θ) + Ω(θ) , (13) 1Alternatively, ϕe is permissible [Kearns and Mansour, 1999]. 2To prevent notational overload, we blend notions of (pointwise) loss and (samplewise) risk, as just losses . Algorithm 1 Ω-R.ADABOOST Input set of rados Sr .= {π1, π2, ..., πn}; T N ; parameters γ (0, 1), ω R+; Step 1 : let θ0 0, w0 (1/n)1 ; Step 2 : for t = 1, 2, ..., T Step 2.1 : call the weak learner: (ι(t), rt) Ω-WL(Sr, wt, γ, ω, θt 1); Step 2.2 : compute update parameters αι(t) and δt (here, π k .= maxj |πjk|): αι(t) (1/(2π ι(t))) log((1 + rt)/(1 rt)) and δt ω (Ω(θt) Ω(θt 1)) ; (16) Step 2.3 : update and normalize weights: for j = 1, 2, ..., n, wtj w(t 1)j exp αtπjι(t) + δt /Zt ; (17) Return θT ; with Ωa regularizer [Bach et al., 2011]. The following shows that when fe in eq. (11) is linear, there is a rado-loss equivalent to this regularized loss, regardless of Ω. Theorem 7 Suppose H contains linear classifiers. Let (ℓe(Se, θ), ℓr(S r , θ)) be any couple of equivalent example-rado losses such that fe in eq. (11) is linear: fe(z) = ae z + be , (14) for some ae > 0, be R. Then for any regularizer Ω(.) (assuming wlog Ω(0) = 0), the regularized example loss ℓe(Se, θ, Ω) is equivalent to rado loss ℓr(S ,Ω,θ r , θ) computed over regularized rados: S ,Ω,θ r .= S r { Ω(θ) θ} , (15) Here, is Minkowski sum and Ω(θ) .= ae Ω(θ)/ θ 2 2 if θ = 0 (and 0 otherwise). Theorem 7 applies to all rado losses (I-IV) in Table 1. The effect of regularization on rados is intuitive from the margin standpoint: assume that a good classifier θ is one that ensures lowerbounded inner products θ z τ for some margin threshold τ. Then any good classifier on a regularized rado πσ shall actually meet, over examples, P i:yi=σi θ (yi xi) τ + ae Ω(θ). This inequality ties an "accuracy" of θ (edges, left hand-side) and its sparsity (right-hand side). Clearly, Theorem 7 has an unfamiliar shape since regularisation modifies data in the rado world: a different θ, or a different Ω, yields a different S ,Ω,θ r , and therefore it may seem very tricky to minimize such a regularized loss. Even more, iterative algorithms like boosting algorithms look at first glance a poor choice, since any update on θ implies an update on the rados as well. What we show in the following Section is essentially the opposite for the exponential rado loss, and a generalization of the RADOBOOST algorithm of Nock et al. [2015], which does not modify rados, is a formal boosting algorithm for a broad set of regularizers. Also, remarkably, only the high-level code of the weak learner depends on the regularizer; that of the strong learner is not affected. 4 Boosting with (rado) regularized losses Ω-R.ADABOOST presents our approach to learning with rados regularized with regularizer Ωto minimise loss ℓexp r (Sr, θ, Ω) in eq. (45). Classifier θt is defined as θt .= Pt t =1 αι(t ) 1ι(t ), where 1k is the kth canonical basis vector. The expected edge rt used to compute αt in eq. (16) is based on the following basis assignation: rι(t) 1 π ι(t) j=1 wtjπjι(t) ( [ 1, 1]) . (19) The computation of rt is eventually tweaked by the weak learner, as displayed in Algorithm ΩWL. We investigate four choices for Ω. For each of them, we prove the boosting ability of ΩR.ADABOOST (Γ is symmetric positive definite, Sd is the symmetric group of order d, |θ| is the Algorithm 2 Ω-WL, for Ω { . 1, . 2 Γ, . , . Φ} Input set of rados Sr .= {π1, π2, ..., πn}; weights w n; parameters γ (0, 1), ω R+; classifier θ Rd; Step 1 : pick weak feature ι [d]; Optional use preference order: ι ι |rι| δι |rι | δι // δι .= ω (Ω(θ + αι 1ι) Ω(θ)), rι is given in (19) and αι is given in (16) Step 2 : if Ω= . 2 Γ then r rι if rι [ γ, γ] sign (rι ) γ otherwise ; (18) else r rι ; Return (ι , r ); vector whose coordinates are the absolute values of the coordinates of θ): θ 1 .= |θ| 1 Lasso θ 2 Γ .= θ Γθ Ridge θ .= maxk |θk| ℓ θ Φ .= max M Sd(M|θ|) ξ SLOPE [Bach et al., 2011, Bogdan et al., 2015, Duchi and Singer, 2009, Su and Candès, 2015]. The coordinates of ξ in SLOPE are ξk .= Φ 1(1 kq/(2d)) where Φ 1(.) is the quantile of the standard normal distribution and q (0, 1); thus, the largest coordinates (in absolute value) of θ are more penalized. We now establish the boosting ability of Ω-R.ADABOOST. We give no direction for Step 1 in Ω-WL, which is consistent with the definition of a weak learner in the boosting theory: all we require from the weak learner is |r.| no smaller than some weak learning threshold γWL > 0. Definition 8 Fix any constant γWL (0, 1). Ω-WL is said to be a γWL-Weak Learner iff the feature ι(t) it picks at iteration t satisfies |rι(t)| γWL, for any t = 1, 2, ..., T. We also provide an optional step for the weak learner in Ω-WL, which we exploit in the experimentations, which gives a total preference order on features to optimise further Ω-R.ADABOOST. Theorem 9 (boosting with ridge). Take Ω(.) = . 2 Γ. Fix any 0 < a < 1/5, and suppose that ω and the number of iterations T of Ω-R.ADABOOST are chosen so that ω < (2a min k max j π2 jk)/(TλΓ) , (21) where λΓ > 0 is the largest eigenvalue of Γ. Then there exists some γ > 0 (depending on a, and given to Ω-WL) such that for any fixed 0 < γWL < γ, if Ω-WL is a γWL-Weak Learner, then Ω-R.ADABOOST returns at the end of the T boosting iterations a classifier θT which meets: ℓexp r (Sr, θT , . 2 Γ) exp( aγ2 WLT/2) . (22) Furthermore, if we fix a = 1/7, then we can fix γ = 0.98, and if a = 1/10, then we can fix γ = 0.999. Two remarks are in order. First, the cases a = 1/7, 1/10 show that Ω-WL can still obtain large edges in eq. (19), so even a strong weak learner might fit in for Ω-WL, without clamping edges. Second, the right-hand side of ineq. (21) may be very large if we consider that mink maxj π2 jk may be proportional to m2. So the constraint on ω is in fact loose. Theorem 10 (boosting with lasso or ℓ ). Take Ω(.) { . 1, . }. Suppose Ω-WL is a γWL-Weak Learner for some γWL > 0. Suppose 0 < a < 3/11 s. t. ω satisfies: ω = aγWL min k max j |πjk| . (23) Then Ω-R.ADABOOST returns at the end of the T boosting iterations a classifier θT which meets: ℓexp r (Sr, θT , Ω) exp( Tγ2 WL/2) , (24) where T = aγWLT if Ω= . 1, and T = (T T ) + aγWL T if Ω= . ; T is the number of iterations where the feature computing the ℓ norm was updated3. We finally investigate the SLOPE choice. The Theorem is proven for ω = 1 in Ω-R.ADABOOST, for two reasons: it matches the original definition [Bogdan et al., 2015] and furthermore it unveils an interesting connection between boosting and SLOPE properties. Theorem 11 (boosting with SLOPE). Take Ω(.) = . Φ. Let a .= min{3γWL/11, Φ 1(1 q/(2d))/ mink maxj |πjk|}. Suppose wlog |θT k| |θT (k+1)|, k, and fix ω = 1. Suppose (i) Ω-WL is a γWL-Weak Learner for some γWL > 0, and (ii) the q-value is chosen to meet: 11 max j |πjk| k Then classifier θT returned by Ω-R.ADABOOST at the end of the T boosting iterations satisfies: ℓexp r (Sr, θT , . Φ) exp( aγ2 WLT/2) . (25) Constraint (ii) on q is interesting in the light of the properties of SLOPE [Su and Candès, 2015]. Modulo some assumptions, SLOPE yields a control the false discovery rate (FDR) i.e., negligible coefficients in the "true linear model θ that are found significant in the learned θ . Constraint (ii) links the "small achievable FDR (upperbounded by q) to the "boostability of the data: the fact that each feature k can be chosen by the weak learner for a "large γWL, or has maxj |πjk| large, precisely flags potential significant features, thus reducing the risk of sparsity errors, and allowing small q, which is constraint (ii). Using the second order approximation of normal quantiles [Su and Candès, 2015], a sufficient condition for (ii) is that, for some K > 0, γWL min j max j |πjk| K p log d + log q 1 ; (26) but minj maxj |πjk| is proportional to m, so ineq. (26), and thus (ii), may hold even for small samples and q-values. An additional Theorem deferred to SM sor space considerations shows that for any applicable choice of regularization (eq. 20), the regularized log-loss of θT over examples enjoys with high probability a monotonically decreasing upperbound with T as: ℓlog e (Se, θ, Ω) log 2 κ T + τ(m), with τ(m) 0 when m (and τ does not depend on T), and κ > 0 does not depend on T. Hence, Ω-R.ADABOOST is an efficient proxy to boost the regularized log-loss over examples, using whichever of the ridge, lasso, ℓ or SLOPE regularization establishing the first boosting algorithm for this choice , or linear combinations of the choices, e.g. for elastic nets. If we were to compare Theorems 9 11 (eqs (22, 24, 25)), then the convergence looks best for ridge (the unsigned exponent is O(γ2 WL)) while it looks slightly worse for ℓ and SLOPE (the unsigned exponent is now O(γ3 WL)), the lasso being in between. 5 Experiments We have implemented Ω-WL4 using the order suggested to retrieve the topmost feature in the order. Hence, the weak learner returns the feature maximising |rι| δι. The rationale for this comes from the proofs of Theorems 9 11, showing that Q t exp( (r2 ι(t)/2 δι(t))) is an upperbound on the exponential regularized rado-loss. We do not clamp the weak learner for Ω(.) = . 2 Γ, so the weak learner is restricted to Step 1 in Ω-WL5. The objective of these experiments is to evaluate Ω-R.ADABOOST as a contender for supervised learning per se. We compared Ω-R.ADABOOST to ADABOOST/ℓ1 regularized-ADABOOST [Schapire and Singer, 1999, Xi et al., 2009]. All algorithms are run for a total of T = 1000 iterations, and at the end of the iterations, the classifier in the sequence that minimizes the empirical loss is kept. Notice therefore that rado-based classifiers are evaluated on the training set which computes the 3If several features match this criterion, T is the total number of iterations for all these features. 4Code available at: http://users.cecs.anu.edu.au/ rnock/ 5the values for ω that we test, in {10 u, u {0, 1, 2, 3, 4, 5}}, are small with respect to the upperbound in ineq. (21) given the number of boosting steps (T = 1000), and would yield on most domains a maximal γ 1. rados. To obtain very sparse solutions for regularized-ADABOOST, we pick its ω (β in [Xi et al., 2009]) in {10 4, 1, 104}. The complete results aggregate experiments on twenty (20) domains, all but one coming from the UCI [Bache and Lichman, 2013] (plus the Kaggle competition domain Give me some credit ), with up to d =500+ features and m =100 000+ examples. Two tables, in the SM (Tables 1 and 2 in Section 3) report respectively the test errors and sparsity of classifiers, whose summary is given here in Table 2. The experimental setup is a ten-folds stratified cross validation for all algorithms and each domain. ADABOOST/regularized-ADABOOST is trained using the complete training fold. When the domain size m 40000, the number of rados n used for Ω-R.ADABOOST is a random subset of rados of size equal to that of the training fold. When the domain size exceeds 40000, a random set of n = 10000 rados is computed from the training fold. Thus, (i) there is no optimisation of the examples chosen to compute rados, (ii) we always keep a very small number of rados compared to the maximum available, and (iii) when the domain size gets large, we keep a comparatively tiny number of rados. Hence, the performances of Ω-R.ADABOOST do not stem from any optimization in the choice or size of the rado sample. Ada . 2 Id . 1 . . Φ Ada 11 10 10 8 9 9 3 3 2 1 . 2 Id 10 17 11 9 7 . 1 10 17 7 7 4 . 11 18 9 9 8 . Φ 10 19 10 10 11 Table 2: Number of domains for which algorithm in row beats algorithm in column (Ada = best result of AD- ABOOST, = Ω-R.ADABOOST not regularized, see text). Experiments support several key observations. First, regularization consistently reduces the test error of Ω-R.ADABOOST, by more than 15% on Magic, and 20% on Kaggle. In Table 2, Ω-R.ADABOOST unregularized (" ") is virtually always beaten by its SLOPE regularized version. Second, Ω-R.ADABOOST is able to obtain both very sparse and accurate classifiers (Magic, Hardware, Marketing, Kaggle). Third, Ω-R.ADABOOST competes or beats ADABOOST on all domains, and is all the better as the domain gets bigger. Even qualitatively as seen in Table 2, the best result obtained by ADABOOST (regularized or not) does not manage to beat any of the regularized versions of Ω-R.ADABOOST on the majority of the domains. Fourth, it is important to have several choices of regularizers at hand. On domain Statlog, the difference in test error between the worst and the best regularization of Ω-R.ADABOOST exceeds 15%. Fifth, as already remarked [Nock et al., 2015], significantly subsampling rados (e.g. Marketing, Kaggle) still yields very accurate classifiers. Sixth, regularization in Ω-R.ADABOOST successfully reduces sparsity to learn more accurate classifiers on several domains (Spectf, Transfusion, Hill-noise, Winered, Magic, Marketing), achieving efficient adaptive sparsity control. Last, the comparatively extremely poor results of ADABOOST on the biggest domains seems to come from another advantage of rados that the theory developed so far does not take into account: on domains for which some features are significantly correlated with the class and for which we have a large number of examples, the concentration of the expected feature value in rados seems to provide leveraging coefficients that tend to have much larger (absolute) value than in ADABOOST, making the convergence of Ω-R.ADABOOST significantly faster than ADABOOST. For example, we have checked that it takes much more than the T = 1000 iterations for ADABOOST to start converging to the results of regularized Ω-R.ADABOOST on Hardware or Kaggle. 6 Conclusion We have shown that the recent equivalences between two example and rado losses can be unified and generalized via a principled representation of a loss function in a two-player zero-sum game. Furthermore, we have shown that this equivalence extends to regularized losses, where the regularization in the rado loss is performed over the rados themselves with Minkowski sums. Our theory and experiments on Ω-R.ADABOOST with prominent regularizers (including ridge, lasso, ℓ , SLOPE) indicate that when such a simple regularized form of the rado loss is available, it may help to devise accurate and efficient workarounds to boost a regularized loss over examples via the rado loss, even when the regularizer is significantly more involved like e.g. for group norms [Bach et al., 2011]. Acknowledgments Thanks are due to Stephen Hardy and Giorgio Patrini for stimulating discussions around this material. F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4:1 106, 2011. K. Bache and M. Lichman. UCI machine learning repository, 2013. M Bogdan, E. van den Berg, C. Sabatti, W. Su, and E.-J. Candès. SLOPE adaptive variable selection via convex optimization. Annals of Applied Statistics, 2015. Also ar Xiv:1310.1969v2. J.-C. Duchi and Y. Singer. Efficient learning using forward-backward splitting. In NIPS*22, pages 495 503, 2009. C. Gentile and M. Warmuth. Linear hinge loss and average margin. In NIPS*11, pages 225 231, 1998. M.J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. J. Comp. Syst. Sc., 58:109 128, 1999. V. Nair and G. Hinton. Rectified linear units improve restricted boltzmann machines. In 27th ICML, pages 807 814, 2010. R. Nock and F. Nielsen. On the efficient minimization of classification-calibrated surrogates. In NIPS*21, pages 1201 1208, 2008. R. Nock, G. Patrini, and A Friedman. Rademacher observations, private data, and boosting. In 32nd ICML, pages 948 956, 2015. G. Patrini, R. Nock, S. Hardy, and T. Caetano. Fast learning from distributed datasets without entity matching. In 26 th IJCAI, 2016. M.-D. Reid, R.-M. Frongillo, R.-C. Williamson, and N.-A. Mehta. Generalized mixability via entropic duality. In 28th COLT, pages 1501 1522, 2015. R.-E. Schapire. The boosting approach to machine learning: An overview. In D.-D. Denison, M.-H. Hansen, C.-C. Holmes, B. Mallick, and B. Yu, editors, Nonlinear Estimation and Classification, volume 171 of Lecture Notes in Statistics, pages 149 171. Springer Verlag, 2003. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. MLJ, 37:297 336, 1999. W. Su and E.-J. Candès. SLOPE is adaptive to unkown sparsity and asymptotically minimax. Co RR, abs/1503.08393, 2015. M. Telgarsky. A primal-dual convergence analysis of boosting. JMLR, 13:561 606, 2012. B. van Rooyen, A. Menon, and R.-C. Williamson. Learning with symmetric label noise: The importance of being unhinged. In NIPS*28, 2015. V. Vapnik. Statistical Learning Theory. John Wiley, 1998. Y.-T. Xi, Z.-J. Xiang, P.-J. Ramadge, and R.-E. Schapire. Speed and sparsity of regularized boosting. In 12th AISTATS, pages 615 622, 2009. H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B, 67:301 321, 2005.