# boosting_with_tempered_exponential_measures__a2b912f3.pdf

Boosting with Tempered Exponential Measures

Richard Nock Google Research richardnock@google.com

Ehsan Amid Google Deep Mind eamid@google.com

Manfred K. Warmuth Google Research manfred@google.com

One of the most popular ML algorithms, ADABOOST, can be derived from the dual of a relative entropy minimization problem subject to the fact that the positive weights on the examples sum to one. Essentially, harder examples receive higher probabilities. We generalize this setup to the recently introduced tempered exponential measures (TEMs) where normalization is enforced on a speciﬁc power of the measure and not the measure itself. TEMs are indexed by a parameter t and generalize exponential families (t 1). Our algorithm, t-ADABOOST, recovers ADABOOST as a special case (t 1). We show that t-ADABOOST retains ADABOOST s celebrated exponential convergence rate on margins when t P r0, 1q while allowing a slight improvement of the rate s hidden constant compared to t 1. t-ADABOOST partially computes on a generalization of classical arithmetic over the reals and brings notable properties like guaranteed bounded leveraging coefﬁcients for t P r0, 1q. From the loss that t-ADABOOST minimizes (a generalization of the exponential loss), we show how to derive a new family of tempered losses for the induction of domain-partitioning classiﬁers like decision trees. Crucially, strict properness is ensured for all while their boosting rates span the full known spectrum of boosting rates. Experiments using t-ADABOOST+trees display that signiﬁcant leverage can be achieved by tuning t.

1 Introduction

ADABOOST is one of the most popular ML algorithms [8, 30]. It efﬁciently aggregates weak hypotheses into a highly accurate linear combination [10]. The common motivations of boosting algorithms focus on choosing good linear weights (the leveraging coefﬁcients) for combining the weak hypotheses. A dual view of boosting highlights the dual parameters, which are the weights on the examples. These weights deﬁne a distribution, and ADABOOST can be viewed as minimizing a relative entropy to the last distribution subject to a linear constraint introduced by the current hypothesis [12]. For this reason (more in 2), ADABOOST s weights deﬁne an exponential family.

In this paper, we go beyond weighing the examples with a discrete exponential family distribution, relaxing the constraint that the total mass be unit but instead requiring it for the measure s 1{p2 tq th power, where t is a temperature parameter. Such measures, called tempered exponential measures (TEMs), have been recently introduced [4]. Here we apply the discrete version of these TEMs for deriving a novel boosting algorithm called t-ADABOOST. Again the measures are solutions to a relative entropy minimization problem, but the relative entropy is built from Tsallis entropy and tempered by a parameter t. As t Ñ 1 TEMs become standard exponential family distributions and our new algorithm merges into ADABOOST. As much as ADABOOST minimizes the exponential loss, t-ADABOOST minimizes a generalization of this loss we denote as the tempered exponential loss.

TEMs were introduced in the context of clustering, where they were shown to improve the robustness to outliers of clustering s population minimizers [4]. They have also been shown to bring low-level sparsity features to optimal transport [3]. Boosting is a high-precision machinery: ADABOOST is known to achieve near-optimal boosting rates under the weak learning assumption [1], but it has

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

long been known that numerical issues can derail it, in particular, because of the unbounded weight update rule [14]. So the question of what the TEM setting can bring for boosting is of primordial importance. As we show, t-ADABOOST can suffer no rate setback as boosting s exponential rate of convergence on margins can be preserved for all t P r0, 1q. Several interesting features emerge: the weight update becomes bounded, margin optimization can be tuned with t to focus on examples with very low margin and besides linear separators, the algorithm can also learn progressively clipped models1. Finally, the weight update makes appear a new regime whereby weights can "switch off and on": an example s weight can become zero if too well classiﬁed by the current linear separator, and later on revert to non-zero if badly classiﬁed by a next iterate. t-ADABOOST makes use of a generalization of classical arithmetic over the reals introduced decades ago [18].

Boosting algorithms for linear models like ADABOOST bring more than just learning good linear separators: it is known that (ada)boosting linear models can be used to emulate the training of decision trees (DT) [16], which are models known to lead to some of the best of-the-shelf classiﬁers when linearly combined [9]. Unsurprisingly, the algorithm obtained emulates the classical top-down induction of a tree found in major packages like CART [6] and C4.5 [23]. The loss equivalently minimized, which is, e.g., Matusita s loss for ADABOOST [30, Section 4.1], is a lot more consequential. Contrary to losses for real-valued classiﬁcation, losses to train DTs rely on the estimates of the posterior learned by the model; they are usually called losses for Class Probability Estimation (CPE [25]). The CPE loss is crucial to elicit because (i) it is possible to check whether it is "good" from the standpoint of properness (Bayes rule is optimal for the loss [28]), and (ii) it conditions boosting rates, only a handful of them being known, for the most popular CPE losses [11, 22, 31].

In this paper, we show that this emulation scheme on t-ADABOOST provides a new family of CPE losses with remarkable constancy with respect to properness: losses are strictly proper (Bayes rule is the sole optimum) for any t P p 8, 2q and proper for t 8. Furthermore, over the range t P r 8, 1s, the range of boosting rates spans the full spectrum of known boosting rates [11].

We provide experiments displaying the boosting ability of t-ADABOOST over a range of t encompassing potentially more than the set of values covered by our theory, and highlight the potential of using t as a parameter for efﬁcient tuning the loss [25, Section 8]. Due to a lack of space, proofs are relegated to the appendix (APP.). A primer on TEMs is also given in APP., Section I.

2 Related work

Boosting refers to the ability of an algorithm to combine the outputs of moderately accurate, "weak" hypotheses into a highly accurate, "strong" ensemble. Originally, boosting was introduced in the context of Valiant s PAC learning model as a way to circumvent the then-existing amount of related negative results [10, 34]. After the ﬁrst formal proof that boosting is indeed achievable [29], ADABOOST became the ﬁrst practical and proof-checked boosting algorithm [8, 30]. Boosting was thus born in a machine learning context, but later on, it also emerged in statistics as a way to learn from class residuals computed using the gradient of the loss [9, 21], resulting this time in a ﬂurry of computationally efﬁcient algorithms, still called boosting algorithms, but for which the connection with the original weak/strong learning framework is in general not known.

Our paper draws its boosting connections with ADABOOST s formal lineage. ADABOOST has spurred a long line of work alongside different directions, including statistical consistency [5], noise handling [15, 16], low-resource optimization [22], etc. The starting point of our work is a fascinating result in convex optimization establishing a duality between the algorithm and its memory of past iteration s performances, a probability distribution of so-called weights over examples [12]. From this standpoint, ADABOOST solves the dual of the optimization of a Bregman divergence (constructed from the negative Shannon entropy as the generator) between weights subject to zero correlation with the last weak classiﬁer s performance. As a consequence, weights deﬁne an exponential family. Indeed, whenever a relative entropy is minimized subject to linear constraints, then the solution is a member of an exponential family of distributions (see e.g. [2, Section 2.8.1] for an axiomatization of exponential families). ADABOOST s distribution on the examples is a member of a discrete exponential family where the training examples are the ﬁnite support of the distribution, sufﬁcient statistics are deﬁned from the weak learners, and the leveraging coefﬁcients are the natural parameters. In summary, there

1Traditionally, clipping a sum is done after it has been fully computed. In our case, it is clipped after each new summand is added.

is an intimate relationship between boosting à-la-ADABOOST, exponential families, and Bregman divergences [7, 12, 20] and our work "elevates" these methods above exponential families.

3 Deﬁnitions

We deﬁne the t-logarithm and t-exponential [17, Chapter 7],

logtpzq . 1 1 t z1 t 1 , exptpzq . r1 p1 tqzs1{p1 tq przs . maxt0, zuq, (1)

where the case t 1 is supposed to be the extension by continuity to the log and exp functions, respectively. To preserve the concavity of logt and the convexity of expt, we need t ě 0. In the general case, we also note the asymmetry of the composition: while expt logtpzq z, @t P R, we have logt exptpzq z for t 1 (@z P R), but

logt exptpzq max " 1 1 t, z * pt ă 1q and logt exptpzq min " 1 t 1, z * pt ą 1q.

Comparisons between vectors and real-valued functions written on vectors are assumed componentwise. We assume t 2 and deﬁne notation t . 1{p2 tq. We now deﬁne the key set in which we model our weights (boldfaces denote vector notation).

Deﬁnition 3.1. The co-simplex of Rm, m is deﬁned as m . tq P Rm : q ě 0 1Jq1{t 1u.

The letters q will be used to denote TEMs in m while p denote the co-density q 1 t or any element of the probability simplex. We deﬁne the general tempered relative entropy as

Dtpq1}qq . ÿ

i Prms q1 i logt q1 i logt qi logt 1 q1 i logt 1 qi, (2)

where rms . t1, ..., mu. The tempered relative entropy is a Bregman divergence with convex generator ϕtpzq . z logt z logt 1pzq (for t P R) and ϕtpzq1 logtpxq. As t Ñ 1, Dtpq, q1q becomes the relative entropy with generator ϕ1pxq x logpxq x.

4 Tempered boosting as tempered entropy projection

We start with a ﬁxed sample S tpxi, yiq : i P rmsu where observations xi lie in some domain X and labels yi are 1. ADABOOST maintains a distribution p over the sample. At the current iteration, this distribution is updated based on a current weak hypothesis h P RX using an exponential update:

p1 i pi expp µuiq ř

k pk expp µukq, where ui . yihpxiq, µ P R.

In [12] this update is motivated as minimizing a relative entropy subject to the constraint that p1 is a distribution summing to 1 and p1Ju 0. Following this blueprint, we create a boosting algorithm maintaining a discrete TEM over the sample which is motivated as a constrained minimization of the tempered relative entropy, with a normalization constraint on the co-simplex of Rm:

q1 . arg min rq P m rq Ju 0

Dtprq}qq, with u P Rm. (3)

We now show that the solution q1 is a tempered generalization of ADABOOST s exponential update.

Theorem 1. For all t P Rzt2u, all solutions to (3) have the form

q1 i exptplogt qi µuiq

ˆ qi bt exptp µuiq

Zt , with a bt b . ra1 t b1 t 1s

where Zt ensures co-simplex normalization of the co-density. Furthermore, the unknown µ satisﬁes

µ P arg max logtp Ztpµqq p arg min Ztpµqq, (5)

Algorithm 1 t-ADABOOSTpt, S, Jq

Input: t P r0, 1s, training sample S, #iterations J; Output: classiﬁers HJ, Hp1{1 tq J (see (9)); Step 1 : initialize tempered weights: q1 p1{mt q 1 p P mq; Step 2 : for j 1, 2, ..., J

Step 2.1 : get weak classiﬁer hj Ð weak_learnerpqj, Sq; Step 2.2 : choose weight update coefﬁcient µj P R; Step 2.3 : @i P rms, for uji . yihjpxiq, update the tempered weights as

qpj 1qi qji bt exptp µjujiq

Ztj , where Ztj qj bt exptp µjujq 1{t . (8)

Step 2.4 : choose leveraging coefﬁcient αj P R;

or equivalently is a solution to the nonlinear equation

q1pµq Ju 0. (6)

Finally, if either (i) t P Rą0zt2u or (ii) t 0 and q is not collinear to u, then Ztpµq is strictly convex: the solution to (3) is thus unique, and can be found from expression (4) by ﬁnding the unique minimizer of (5) or (equivalently) the unique solution to (6).

(Proof in APP., Section II.1) The t-product bt, which satisﬁes exptpa bq exptpaq bt exptpbq, was introduced in [18]. Collinearity never happens in our ML setting because u contains the edges of a weak classiﬁer: q ą 0 and collinearity would imply that the weak classiﬁer performs perfect classiﬁcation, and thus defeats the purpose of training an ensemble. @t P Rzt2u, we have the simpliﬁed expression for the normalization coefﬁcient of the TEM and the co-density p1 of q1:

Zt }expt plogt q µ uq}1{t ; p1 i pi bt expt µui

with Z1 t . Z1{t

5 Tempered boosting for linear classiﬁers and clipped linear classiﬁers

Models A model (or classiﬁer) is an element of RX. For any model H, its empirical risk over S is F0{1p H, Sq . p1{mq ř

i Jyi signp Hpxiqq K where J.K, Iverson s bracket [13], is the Boolean value of the inner predicate. We learn linear separators and clipped linear separators. Let pvjqjě1 be the terms of a series and δ ě 0. The clipped sum of the series is:

j Pr Js vj . min

% δ, v J pδq

j Pr J 1s vj

- p P r δ, δsq, for J ą 1,

and we deﬁne the base case J 1 by replacing the inner clipped sum with 0. Note that clipped summation is non-commutative, and so is different from clipping in r δ, δs the whole sum itself2. Given a set of so-called weak hypotheses hj P RX and leveraging coefﬁcients αj P R (for j P r Js), the corresponding linear separators and clipped linear separators are

j Pr Js αjhjpxq ; Hpδq J pxq . pδq

j Pr Js αjhjpxq. (9)

Tempered boosting and its general convergence Our algorithm, t-ADABOOST, is presented in Algorithm 1, using presentation conventions from [30]. Before analyzing its convergence, several properties are to be noted for t-ADABOOST: ﬁrst, it keeps the appealing property, introduced by ADABOOST, that examples receiving the wrong class by the current weak classiﬁer are reweighted

2Fix for example a 1, b 3, δ 2. For v1 a, v2 b, the clipped sum is 2 1 3, but for v1 b, v2 a, the clipped sum becomes 1 2 1.

higher (if µj ą 0). Second, the leveraging coefﬁcients for weak classiﬁers in the ﬁnal classiﬁer (αjs) are not the same as the ones used to update the weights (µjs), unless t 1. Third and last, because of the deﬁnition of expt (1), if t ă 1, tempered weights can switch off and on, i.e., become 0 if an example is "too well classiﬁed" and then revert back to being ą 0 if the example becomes wrongly classiﬁed by the current weak classiﬁer (if µj ą 0). To take into account those zeroing weights, we denote rms: j . ti : qji 0u and m: j . Cardprms: jq (@j P r Js and Card denotes the cardinal).

Let Rj . maxi Rrms: j |yihjpxiq|{q1 t ji and q: j . maxi Prms: j |yihjpxiq|1{p1 tq{R1{p1 tq j . It is worth

noting that q: j is homogeneous to a tempered weight.

Theorem 2. At iteration j, deﬁne the weight function q1 ji . qji if i R rms: j and q: j otherwise; set

p1 m: jq: j 2 tq Rj ÿ

i Prms q1 jiyihjpxiq p P r 1, 1sq. (10)

In algorithm t-ADABOOST, consider the choices (with the convention ś0 k 1 vk . 1)

ˆ 1 ρj M1 tp1 ρj, 1 ρjq

, αj . m1 t

where Mqpa, bq . ppaq bqq{2q1{q is the q-power mean. Then for any H P t HJ, Hp1{1 tq J u, its empirical risk is upperbounded as:

F0{1p H, Sq ď

j 1 Z2 t tj ď

1 m: jq: j 2 t Ktpρjq ˆ Ktpzq . 1 z2

M1 tp1 z, 1 zq

(Proof in APP., Section II.2) We jointly comment t-ADABOOST and Theorem 2 in two parts.

Case t Ñ 1 : t-ADABOOST converges to ADABOOST and Theorem 2 to its convergence analysis: t-ADABOOST converges to ADABOOST as presented in [30, Figure 1]: the tempered simplex becomes the probability simplex, bt converges to regular multiplication, weight update (8) becomes ADABOOST s, αj Ñ µj in (11) and ﬁnally the expression of µj converges to ADABOOST s leveraging coefﬁcient in [30] (limtÑ1 M1 tpa, bq ?

ab). Even guarantee (12) converges to ADABOOST s popular guarantee of [30, Corollary 1] (limtÑ1 Ktpzq ?

1 z2, m: j 0). Also, in this case, we

learn only the unclipped classiﬁer since limtÑ1 Hp1{1 tq J HJ.

Case t ă 1: Let us ﬁrst comment on the convergence rate. The proof of Theorem 2 shows that Ktpzq ď expp z2{p2t qq. Suppose there is no weight switching, so m: j 0, @j (see Section 7) and, as in the boosting model, suppose there exists γ ą 0 such that |ρj| ě γ, @j. Then t ADABOOST is guaranteed to attain empirical risk below some ε ą 0 after a number of iterations equal to J p2t {γ2q logp1{εq. t being an increasing function of t P r0, 1s, we see that t ADABOOST is able to slightly improve upon ADABOOST s celebrated rate [32]. However, t 1{2 for t 0 so the improvement is just on the hidden constant. This analysis is suited for small values of |ρj| and does not reveal an interesting phenomenon for better weak hypotheses. Figure 1 compares Ktpzq curves (K1pzq . limtÑ1 Ktpzq ?

1 z2 for ADABOOST, see [30, Corollary 1]), showing the case t ă 1 can be substantially better, especially when weak hypotheses are not "too weak". If m: j ą 0, switching weights can impede our convergence analysis, though exponential convergence is

always possible if m: jq: j 2 t is small enough; also, when it is not, we may in fact have converged to a good model (see APP., Remark 1). A good criterion to train weak hypotheses is then the optimization of the edge ρj, thus using q1 j normalized in the simplex. Other key features of t-ADABOOST are as follows. First, the weight update and leveraging coefﬁcients of weak classiﬁers are bounded because |µj| ă 1{p Rjp1 tqq (APP., Lemma H) (this is not the case for t Ñ 1 ). This guarantees that new weights are bounded before normalization (unlike for t Ñ 1 ). Second, we remark that µj αj if t 1. Factor m1 t is added for convergence analysis purposes; we can discard it to train the unclipped classiﬁer: it does not change its empirical risk. This is, however, different for factor śj 1 k 1 Zk: from (12), we conclude that this is an indication of how well the past ensemble performs.

-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1

t=1 (Ada Boost)

Figure 1: Plot of Ktpzq in (12), t P r0, 1s (the smaller, the better for convergence).

As it gets better and better, it progressively dampens the leverage of the next weak classiﬁers, a phenomenon that does not occur in boosting, where an excellent weak hypothesis on the current weights can have a leveraging coefﬁcient so large that it wipes out the classiﬁcation of the past ones. This can be useful to control numerical instabilities.

Extension to margins A key property of boosting algorithms like ADABOOST is to be able to boost not just the empirical risk but more generally margins [19, 30], where a margin integrates both the accuracy of label prediction but also the conﬁdence in prediction (say |H|). We generalize the margin notion of [19] to the tempered arithmetic and let νtppx, yq, Hq . tanhtpy Hpxq{2q denote the margin of H on example px, yq, where tanhtpzq . p1 exptp 2zqq{p1 exptp 2zqqp P r 1, 1sq is the tempered hyperbolic tangent. The objective of minimizing the empirical risk is generalized to minimizing the margin risk, Ft,θp H, Sq . p1{mq ř

i Jνtppxi, yiq, Hq ď θK, where θ P p 1, 1q. Guarantees on the empirical risk are guarantees on the margin risk for θ 0 only. In just a few steps, we can generalize Theorem 2 to all θ P p 1, 1q. For space reason, we state the core part of the generalization, from which extending it to a generalization of Theorem 2 is simple. Theorem 3. For any θ P p 1, 1q and t P r0, 1s, the guarantee of algorithm t-ADABOOST in Theorem 2 extends to the margin risk, with notations from Theorem 2, via:

Ft,θp H, Sq ď ˆ1 θ

j 1 Z2 t tj . (13)

(Proof in APP., Section II.3) At a high level, t-ADABOOST brings similar margin maximization properties as ADABOOST. Digging a bit in (13) reveals an interesting phenomenon for t 1 on how margins are optimized compared to t 1. Pick θ ă 0, so we focus on those examples for which the classiﬁer H has high conﬁdence in its wrong classiﬁcation. In this case, factor pp1 θq{p1 θqq2 t is increasing as a function of t P r0, 1s (and this pattern is reversed for θ ą 0). In words, the smaller we pick t P r0, 1s and the better is the bound in (13), suggesting increased "focus" of t-ADABOOST on increasing the margins of examples with low negative margin (e.g. the most difﬁcult ones) compared to the case t 1.

The tempered exponential loss In the same way as ADABOOST introduced the now famous exponential loss, (12) recommends to minimize the normalization coefﬁcient, following (7),

Z2 t tj pµq expt logt qj µ uj 1{t

1{t pwith uji . yihjpxiqq . (14)

We cannot easily unravel the normalization coefﬁcient to make appear an equivalent generalization of the exponential loss, unless we make several assumptions, one being maxi |hjpxiq| is small enough for any j P r Js. In this case, we end up with an equivalent criterion to minimize which looks like

Ftp H, Sq 1 m ÿ

i exp2 t t p yi Hpxiqq , (15)

where we have absorbed in H the factor m1 t appearing in the expt (scaling H by a positive value does not change its empirical risk). This deﬁnes a generalization of the exponential loss which we call the tempered exponential loss. Notice that one can choose to minimize Ftp H, Sq disregarding any constraint on |H|.

6 A broad family of boosting-compliant proper losses for decision trees

Losses for class probability estimation When it comes to tabular data, it has long been known that some of the best models to linearly combine with boosting are decision trees (DT, [9]). Decision trees, like other domain-partitioning classiﬁers, are not trained by minimizing a surrogate loss deﬁned over real-valued predictions, but deﬁned over class probability estimation (CPE, [26]), those estimators being posterior estimation computed at the leaves. Let us introduce a few deﬁnitions for those. A CPE loss ℓ: t 1, 1u ˆ r0, 1s Ñ R is ℓpy, uq . Jy 1K ℓ1puq Jy 1K ℓ 1puq. (16) Functions ℓ1, ℓ 1 are called partial losses. The pointwise conditional risk of local guess u P r0, 1s with respect to a ground truth v P r0, 1s is: Lpu, vq . v ℓ1puq p1 vq ℓ 1puq. (17) A loss is proper iff for any ground truth v P r0, 1s, Lpv, vq infu Lpu, vq, and strictly proper iff u v is the sole minimizer [26]. The (pointwise) Bayes risk is Lpvq . infu Lpu, vq. The log/cross-entropy-loss, square-loss, Matusita loss are examples of CPE losses. One then trains a DT minimizing the expectation of this loss over leaves posteriors, Eλr Lppλqs, pλ being the local proportion of positive examples at leaf λ or equivalently, the local posterior.

Deriving CPE losses from (ada)boosting Recently, it was shown how to derive in a general way a CPE loss to train a DT from the minimization of a surrogate loss with a boosting algorithm [16]. In our case, the surrogate would be Ztj (14) and the boosting algorithm, t-ADABOOST. The principle is simple and ﬁts in four steps: (i) show that a DT can equivalently perform simple linear classiﬁcations, (ii) use a weak learner that designs splits and the boosting algorithm to ﬁt the leveraging coefﬁcient and compute those in closed form, (iii) simplify the expression of the loss using those, (iv) show that the expression simpliﬁed is, in fact, a CPE loss. To get (i), we remark that a DT contains a tree (graph). One can associate to each node a real value. To classify an observation, we sum all reals from the root to a leaf and decide on the class based on the sign of the prediction, just like for any real-valued predictor. Suppose we are at a leaf. What kind of weak hypotheses can create splits "in disguise"? Those can be of the form hjpxq . Jxk ě aj K bj, aj, bj P R, where the observation variable xk is assumed real valued for simplicity and the test Jxk ě aj K splits the leaf s domain in two non-empty subsets. This creates half of the split. hjpxq . Jxk ă aj K bj creates the other half of the split. Remarkably hj satisﬁes the weak learning assumption iff hj does [16]. So we get the split design part of (ii). We compute the leveraging coefﬁcients at the new leaves from the surrogate s minimization / boosting algorithm, end up with new real predictions at the new leaves (instead of the original bj, bj), push those predictions in the surrogate loss for (iii), simplify it and, quite remarkably end up with a loss of the form Eλr Lppλqs, where L turns out to be the pointwise Bayes risk L of a proper loss [16] (notation from [26]).

In the case of [16], it is, in fact, granted that we end up with such a "nice" CPE loss because of the choice of the surrogates at the start. In our case, however, nothing grants this a priori if we start from the tempered exponential loss Ztj (14) so it is legitimate to wonder whether such a chain of derivations (summarized) can happen to reverse engineer an interesting CPE loss: Ztj ?ÞÑ L ?ÞÑ Lptq ?ÞÑ ℓptq 1 ; ℓptq 1 pproper ? strictly proper ? for which ts ?, ...q (18) When such a complete derivation happens until the partial losses ℓ1; ℓ 1 and their properties, we shall write that minimizing Ztj elicits the corresponding loss and partial losses.

Theorem 4. Minimizing Ztj elicits the CPE loss we deﬁne as the tempered loss, with partial losses:

ℓptq 1 puq . ˆ 1 u M1 tpu, 1 uq

2 t , ℓptq 1puq . ℓptq 1 p1 uq, pt P r 8, 2sq. (19)

The tempered loss is symmetric, differentiable, strictly proper for t P p 8, 2q and proper for t 8.

perr (not clipped) perr (clipped) min weight max weight

abalone, t ě 0.0, η 0.0

qsar, t ě 0.6, η 0.1

adult, t ě 0.6, η 0.1

Table 1: Experiments on t-ADABOOST comparing with ADABOOST (t 1, bullets) on three domains (rows), displaying from left to right the estimated true error of non-clipped and clipped models, and the min and max codensity weights. These domains were chosen to give an example of three different situations: small values of t perform well (abalone), the best performance is achieved by the largest t (e.g. ADABOOST, qsar), and the worst performance is achieved by the largest t (adult). Topmost row is without noise (η 0) while the others are with 10% training noise; t scale displayed with varying color and width (colormap indicated on each plot). Averages shown for readability: see Table 2 for exhaustive statistical tests.

Differentiability means the partial losses are differentiable, and symmetry follows from the relationship between partial losses [20] (the proof, in APP., Section II.4, derives the inﬁnite case, ℓp 8q 1 puq 2 Ju ď 1{2K) Let us explicit the Bayes risk of the tempered loss and a key property. Lemma 1. The Bayes risk of the tempered loss is (Mq deﬁned in Theorem 2):

Lptqpuq 2up1 uq M1 tpu, 1 uq, (20)

and it satisﬁes @u P r0, 1s, @z P r2 mintu, 1 uu, 1s, Dt P r 8, 2s such that Lptqpuq z.

Lemma 1, whose proof is trivial, allows to show a key boosting result: t 1 retrieves Matusita s loss, for which a near-optimal boosting rate is known [11] while t 8 retrieves the empirical risk, which yields the worst possible guarantee [11]. In between, we have, for example, CART s Gini criterion for t 0, which yields an intermediate boosting guarantee. Continuity with respect to t of the Bayes risks in between the empirical risk and Matusita s loss means the boosting ranges of the tempered loss cover the full known spectrum of boosting rates for t P r 8, 1s. We know of no (differentiable and) proper CPE loss with such coverage. Note that (i) this is a non-constructive result as we do not associate a speciﬁc t for a speciﬁc rate, and (ii) the state-of-the-art boosting rates for DT induction do not seem to cover the case t P p1, 2q, thus left as an open question.

7 Experiments

We have performed experiments on a testbed of 10 UCI domains, whose details are given in APP. (Section A3). Experiments were carried out using a 10-fold stratiﬁed cross-validation procedure.

To compare t-ADABOOST with ADABOOST, we ran t-ADABOOST with a ﬁrst range of values of t P t0.0, 0.2, 0.4, 0.6, 0.8, 0.9u. This is in the range of values covered by our convergence result for linear separators in Theorem 2. Our results on decision tree induction cover a much wider range, in particular for t P p1, 2q. To assess whether this can be an interesting range to study, we added t 1.1 to the set of tested t values. When t ą 1, some extra care is to be put into computations because the weight update becomes unbounded, in a way that is worse than ADABOOST. Indeed, as can be seen from (8), if µjyihjpxiq ď 1{pt 1q (the example is badly classiﬁed by the current weak hypothesis, assuming wlog µj ą 0), the weight becomes inﬁnity before renormalization. In our experiments, picking a value of t close to 2 clearly shows this problem, so to be able to still explore whether t ą 1 can be useful, we picked a value close to 1, namely t 1.1, and checked that in our experiments this produced no such numerical issue. We also considered training clipped and not clipped models.

All boosting models were trained for a number of J 20 decision trees (The appendix provides experiments on training bigger sets). Each decision tree is induced using the tempered loss with the corresponding value of t (see Theorem 4) following the classical top-down template, which consists in growing the current heaviest leaf in the tree and picking the best split for the leaf chosen. We implemented t-ADABOOST exactly as in Section 5, including computing leveraging coefﬁcients as suggested. Thus, we do not scale models. More details are provided in the appendix. In our experiments, we also included experiments on a phenomenon highlighted more than a decade ago [15] and ﬁne-tuned more recently [16], the fact that a convex booster s model is the weakest link when it has to deal with noise in training data. This is an important task because while the tempered exponential loss is convex, it does not ﬁt into the blueprint loss of [15, Deﬁnition 1] because it is not C1 if t 1. One might thus wonder how t-ADABOOST behaves when training data is affected by noise. Letting η denote the proportion of noisy data in the training sample, we tried η P t0.0, 0.1u (The appendix provides experiments on more noise levels). We follow the noise model of [15] and thus independently ﬂip the true label of each example with probability η.

For each run, we recorded the average test error and the average maximum and minimum co-density weight. Table 1 presents a subset of the results obtained on three domains. Table 2 presents a more synthetic view in terms of statistical signiﬁcance of the results for t 1 vs. t 1 (ADABOOST). The table reports only results for t ě 0.6 for synthesis purposes. Values t ă 0.6 performed on average slightly worse than the others but on some domains, as the example of abalone suggests in Table 2 (the plots include all values of t tested in r0, 1.1s), we clearly got above-par results for such small values of t, both in terms of ﬁnal test error but also fast early convergence to low test error. This comment can be generalized to all values of t.

The weights reveal interesting patterns as well. First, perhaps surprisingly, we never encountered the case where weights switch off, regardless of the value of t. The average minimum weight curves of Table 1 generalize to all our tests (see the appendix). This does not rule out the fact that boosting for a much longer number of iterations might lead to weights switching off/on, but the fact that this does not happen at least early during boosting probably comes from the fact that the leveraging coefﬁcients for weights (µ.) are bounded. Furthermore, their maximal absolute value is all the smaller as t decreases to 0. Second, there is a pattern that also repeats on the maximum weights, not on all domains but on a large majority of them and can be seen in abalone and adult in Table 1: the maximum weight of ADABOOST tends to increase much more rapidly compared to t-ADABOOST with t ă 1. In the latter case, we almost systematically observe that the maximum weight tends to be upperbounded, which is not the case for ADABOOST (the growth of the maximal weight looks almost linear). Having bounded weights could be of help to handle numerical issues of (ada)boosting [14].

Our experiments certainly conﬁrm the boosting nature of t-ADABOOST if we compare its convergence to that of ADABOOST: more often than not, it is in fact comparable to that of ADABOOST. While this applies broadly for t ě 0.6, we observed examples where much smaller values (even t 0.0) could yield such fast convergence. Importantly, this applies to clipped models as well and it is important to notice because it means attaining a low "boosted" error does not come at the price of learning models with large range. This is an interesting property: for t 0.0, we would be guaranteed that the computation of the clipped prediction is always in r 1, 1s. Generalizing our comment on small values of t above, we observed that an efﬁcient tuning algorithm for t could be able to get very substantial leverage over ADABOOST. Table 2 was crafted for a standard limit p-val of 0.1 and "blurs" the best results that can be obtained. On several domains (winered, abalone, eeg, creditcard, adult), applicable p-values for which we would conclude that some t 1 performs better than t 1 drop in

η 0.0 0.1 t 0.6 0.8 0.9 1.1 0.6 0.8 0.9 1.1 Jclipped K 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 #better 2 3 1 2 1 3 1 1 1 2 2 1 #equivalent 5 5 6 6 7 7 6 7 4 8 8 7 8 9 8 10 #worse 3 2 3 2 2 4 3 5 1 1 1 2

Table 2: Outcomes of student paired t-tests over 10 UCI domains, with training noise η P t0.0, 0.1u, for t P t0.6, 0.8, 0.9, 1.0, 1.1u and with / without clipped models. For each triple (η, t, Jclipped K), we give the number of domains for which the corresponding setting of t-ADABOOST is statistically better than ADABOOST(#better), the number for which it is statistically worse (#worse) and the number for which we cannot reject the assumption of identical performances. Threshold p val = 0.1.

between 7E 4 and 0.05. Unsurprisingly, ADABOOST also manages to beat signiﬁcantly alternative values of t in several cases. Our experiments with training noise (η 0.1) go in the same direction. Looking at Table 1, one could eventually be tempted to conclude that t slightly smaller than 1.0 may be a better choice than adaboosting (t 1), as suggested by our results for t 0.9, but we do not think this produces a general "rule-of-thumb". There is also no apparent "noise-dependent" pattern that would obviously separate the cases t ă 1 from t 1 even when the tempered exponential loss does not ﬁt to [15] s theory. Finally, looking at the results for t ą 1 also yields the same basic conclusions, which suggests that boosting can be attainable outside the range covered by our theory (in particular Theorem 2).

All this brings us to the experimental conclusion that the question does not reside on opposing the case t 1 to the case t 1. Rather, our experiments suggest pretty much like our theory does that the actual question resides in how to efﬁciently learn t on a domain-dependent basis. Our experiments indeed demonstrate that substantial gains could be obtained, to handle overﬁtting or noise.

8 Discussion, limitations and conclusion

ADABOOST is one of the original and simplest Boosting algorithms. In this paper we generalized ADABOOST to maintaining a tempered measure over the examples by minimizing a tempered relative entropy. We kept the setup as simple as possible and therefore focused on generalizing ADABOOST. However more advanced boosting algorithms have been designed based on relative entropy minimization subject to linear constraints. There are versions that constrain the edges of all past hypotheses to be zero [36]. Also, when the maximum margin of the game is larger than zero, then ADABOOST cycles over non-optimal solutions [27]. Later Boosting algorithms provably optimize the margin of the solution by adjusting the constraint value on the dual edge away from zero (see e.g. [24]). Finally, the ELRP-Boost algorithm optimizes a trade off between relative entropy and the edge [35]. We conjecture that all of these orthogonal direction have generalizations to the tempered case as well and are worth exploring.

These are theoretical directions that, if successful, would contribute to bring more tools to the design of rigorous boosting algorithms. This is important because boosting suffers several impediments, not all of which we have mentioned: for example, to get statistical consistency for ADABOOST, it is known that early stopping is mandatory [5]. More generally, non-Lipschitz losses like the exponential loss seem to be harder to handle compared to Lipschitz losses [33] (but they yield in general better convergence rates). The validity of the weak learning assumption of boosting can also be discussed, in particular regarding the negative result of [15] which advocates, beyond just better (ada)boosting, for boosting for more classes of models / architectures [16]. Alongside this direction, we feel that our experiments on noise handling give a preliminary account of the fact that there is no "one t ﬁts all" case, but a much more in depth analysis is required to elicit / tune a "good" t. This is a crucial issue for noise handling [16], but as we explain in Section 7, this could bring beneﬁts in much wider contexts as well.

Acknowledgments

The authors thank the reviewers for numerous comments that helped improving the paper s content.

[1] N. Alon, A. Gonen, E. Hazan, and S. Moran. Boosting simple learners. In STOC 21, 2021.

[2] S.-I. Amari. Information Geometry and Its Applications. Springer-Verlag, Berlin, 2016.

[3] E. Amid, F. Nielsen, R. Nock, and M.-K. Warmuth. Optimal transport with tempered exponential measures. Co RR, abs/2309.04015, 2023.

[4] E. Amid, R. Nock, and M.-K. Warmuth. Clustering above exponential families with tempered exponential measures. In 26th AISTATS, 2023.

[5] P. Bartlett and M. Traskin. Adaboost is consistent. In NIPS*19, 2006.

[6] L. Breiman, J. H. Freidman, R. A. Olshen, and C. J. Stone. Classiﬁcation and regression trees. Wadsworth, 1984.

[7] M. Collins, R. Schapire, and Y. Singer. Logistic regression, adaboost and Bregman distances. In Proc. of the 13 th International Conference on Computational Learning Theory, pages 158 169, 2000.

[8] Y. Freund and R. E. Schapire. A Decision-Theoretic generalization of on-line learning and an application to Boosting. J. Comp. Syst. Sc., 55:119 139, 1997.

[9] J. Friedman, T. Hastie, and R. Tibshirani. Additive Logistic Regression : a Statistical View of Boosting. Ann. of Stat., 28:337 374, 2000.

[10] M.J. Kearns. Thoughts on hypothesis boosting, 1988. ML class project.

[11] M.J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. In Proc. of the 28 th ACM STOC, pages 459 468, 1996.

[12] J. Kivinen and M.-K. Warmuth. Boosting as entropy projection. In COLT 99, pages 134 144, 1999.

[13] D.-E. Knuth. Two notes on notation. The American Mathematical Monthly, 99(5):403 422, 1992.

[14] R. Kohavi. Improving accuracy by voting classiﬁcation algorithms: Boosting, bagging, and variants. In Workshop on Computation-Intensive Machine Learning Techniques, 1998.

[15] P.-M. Long and R.-A. Servedio. Random classiﬁcation noise defeats all convex potential boosters. MLJ, 78(3):287 304, 2010.

[16] Y. Mansour, R. Nock, and R.-C. Williamson. Random classiﬁcation noise does not defeat all convex potential boosters irrespective of model choice. In 40th ICML, 2023.

[17] J. Naudts. Generalized thermostatistics. Springer, 2011.

[18] L. Nivanen, A. Le Méhauté, and Q.-A. Wang. Generalized algebra within a nonextensive statistics. Reports on Mathematical Physics, 52:437 444, 2003.

[19] R. Nock and F. Nielsen. A Real Generalization of discrete Ada Boost. Artiﬁcial Intelligence, 171:25 41, 2007.

[20] R. Nock and F. Nielsen. On the efﬁcient minimization of classiﬁcation-calibrated surrogates. In NIPS*21, pages 1201 1208, 2008.

[21] R. Nock and F. Nielsen. The phylogenetic tree of Boosting has a bushy carriage but a single trunk. PNAS, 117:8692 8693, 2020.

[22] R. Nock and R.-C. Williamson. Lossless or quantized boosting with integer arithmetic. In 36th ICML, pages 4829 4838, 2019.

[23] J. R. Quinlan. C4.5 : programs for machine learning. Morgan Kaufmann, 1993.

[24] G. Rätsch and M.-K. Warmuth. Efﬁcient margin maximizing with boosting. JMLR, pages 2131 2152, december 2005.

[25] M.-D. Reid and R.-C. Williamson. Composite binary losses. JMLR, 11:2387 2422, 2010.

[26] M.-D. Reid and R.-C. Williamson. Information, divergence and risk for binary experiments. JMLR, 12:731 817, 2011. [27] C. Rudin, I. Daubechies, and R.-E. Schapire. Dynamics of adaboost: cyclic behavior and convergence of margins. JMLR, pages 1557 1595, December 2004. [28] L.-J. Savage. Elicitation of personal probabilities and expectations. J. of the Am. Stat. Assoc., pages 783 801, 1971. [29] R. E. Schapire. The strength of weak learnability. MLJ, pages 197 227, 1990. [30] R. E. Schapire and Y. Singer. Improved boosting algorithms using conﬁdence-rated predictions. MLJ, 37:297 336, 1999. [31] T. Sypherd, R. Nock, and L. Sankar. Being properly improper. In 39th ICML, 2022. [32] M. Telgarsky. A primal-dual convergence analysis of boosting. JMLR, 13:561 606, 2012. [33] M. Telgarsky. Boosting with the logistic loss is consistent. In 26 th COLT, pages 911 965, 2013. [34] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27:1134 1142, 1984. [35] M.-K. Warmuth, K.-A. Glocer, and S.-V.-N. Vishwanathan. Entropy regularized LPBoost. In Algorithmic Learning Theory, pages 256 271. Springer Berlin Heidelberg, 2008. [36] M.-K. Warmuth, J. Liao, and G. Rätsch. Totally corrective boosting algorithms that maximize the margin. In icml 06: proceedings of the 23rd international conference on machine learning, pages 1001 1008, 2006.

This is the Appendix to Paper "Boosting with Tempered Exponential Measures". To differentiate with the numberings in the main ﬁle, the numbering of Theorems, Lemmata, Deﬁnitions is letter-based (A, B, ...).

Table of contents

A short primer on Tempered Exponential Measures Pg 14

Supplementary material on proofs Pg 15 ãÑ Proof of Theorem 1 Pg 15 ãÑ Proof of Theorem 2 Pg 18 ãÑ Proof of Theorem 3 Pg 27 ãÑ Proof of Theorem 4 Pg 27

Supplementary material on experiments Pg 32

I A short primer on Tempered Exponential Measures

We describe here the minimal amount of material necessary to understand how our approach to boosting connects to these measures. We refer to [4] for more details. With a slight abuse of notation, we deﬁne the perspective transforms plogtq pzq . t logt pz{t q and pexptq pzq . t expt pz{t q. Recall that t . 1{p2 tq. Deﬁnition A. [4] A tempered exponential measure (TEM) family is a set of unnormalized densities in which each element admits the following canonical expression:

qt|θpxq . exptpθJϕpxqq

exptp Gtpθqq exptpθJϕpxq at Gtpθqq ˆ a at b . a b 1 p1 tqb

where θ is the element s natural parameter, ϕpxq is the sufﬁcient statistics and

Gtpθq plogtq ż pexptq pθJϕpxqqdξ

is the (convex) cumulant, ξ being a base measure (implicit).

Except for t 1 (which reduces a TEM family to a classical exponential family), the total mass of a

TEM is not 1 (but it has an elegant closed form expression [4]). However, the exponentiated q1{t

t|θ does sum to 1. In the discrete case, this justiﬁes extending the classical simplex to what we denote as the co-simplex.

Deﬁnition B. The co-simplex of Rm, m is deﬁned as m . tq P Rm : q ě 0 1Jq1{t 1u.

The connection between t-ADABOOST s update and TEM s is immediate from the equation s update ((4) in MF). We can show that m can also be represented as TEMs.

Lemma A. m is a (discrete) family of tempered exponential measures.

Proof. We proceed as in [2, Section 2.2.2] for exponential families: let q P m, which we write

i Prms qi Ji n K, n P rms. (22)

JπK, the Iverson bracket [13], takes value 1 if Boolean predicate π is true (and 0 otherwise). We create m 1 natural parameters and the cumulant,

θi . logt qi qm , i P rm 1s ; Gtpθq . logt 1 qm ,

and end up with (22) also matching the atom mass function

qpnq expt ř

i Prm 1s θi Ji n K

expt Gtpθq ,

which clearly deﬁnes a tempered exponential measure over rms. This ends the proof of Lemma A.

II Supplementary material on proofs

II.1 Proof of Theorem 1

To improve readability, we remove dependency in t in normalization coefﬁcient Z. We use notations from [4, proof of Theorem 3.2] and denote the Lagrangian

i νi qi µ ÿ

i qiui, (23)

which yields BL{B qi logt qi logt qi λ q1 t i νi µui (λ absorbs factor 2 t), and, rearranging (absorbing factor 1 t in νi),

p1 p1 tqλq q1 t i νi 1 p1 tqplogt qi µuiq, @i P rms. (24)

We see that λ 1{p1 tq otherwise the Lagrangian drops its dependence in the unknown. In fact, the solution necessarily has 1 p1 tqλ ą 0. To see this, we distinguish two cases: (i) if some uk 0, then since logt qk ě 1{p1 tq there would be no solution to (24) if 1 p1 tqλ ă 0 because of the KKT conditions νi ě 0, @i P rms; (ii) otherwise, if all uk 0, @k P rms, then there must be two coordinates of different signs otherwise there is no solution to our problem (3) (main ﬁle, we must have indeed q ě 0 because of the co-simplex constraint). Thus, there exists at least one coordinate k P rms for which p1 tqµuk ą 0 and since logt qk ě 1{p1 tq (deﬁnition of logt) and νk ě 0 (KKT conditions), the RHS of (24) for i k is ą 0, preventing 1 p1 tqλ ă 0 in the LHS.

We thus have 1 p1 tqλ ą 0. The KKT conditions pνi ě 0, νi qi 0, @i P rmsq yield the following: 1 p1 tqplogt qi µuiq ą 0 imply νi 0 and 1 p1 tqplogt qi µuiq ď 0 imply q1 t i 0 so we get the necessary form for the optimum:

qi expt plogt qi µuiq

qi bt exptp µuiq

where λ or Zt . expt λ ensures normalisation for the co-density. Note that we have a simpliﬁed expression for the co-density:

pi pji bt expt p µui{t q

Zco t , (26)

with Zco t . Z1{t

i pji bt expt p µui{t q. For the analytic form in (25), we can simplify the Lagrangian to a dual form that depends on µ solely:

Dpµq p qpµq}qq µ ÿ

i qipµqui. (27)

The proof of (5) (main ﬁle) is based on a key Lemma.

Lemma B. For any q having form (25) such that q Ju 0, Dpµq logt Ztpµq.

Proof. For any q having form (25), denote

rms . ti : qi 0u. (28)

We ﬁrst compute (still using λ . logt Ztpµq for short):

i qi logt qi

i Prms qi logt

ˆexpt plogt qi µuiq

i Prms qi ˆ 1 1 t 1 p1 tqplogt qi µuiq

i Prms qi ˆq1 t i p1 tqµui 1 p1 tqλ

µ 1 p1 tqλ ÿ

i Prms qiui 1 p1 tqp1 p1 tqλq ÿ

i Prms qiq1 t i 1 1 t ÿ

µ 1 p1 tqλ q Ju loooooooooooomoooooooooooon

1 p1 tqp1 p1 tqλq ÿ

i Prms qiq1 t i looooooooooooooooooooomooooooooooooooooooooon

i Prms qi looooooomooooooon

Remark that in the last identity, we have put back summations over the complete set rms of indices. We note that B 0 because q Ju 0. We then remark that without replacing the expression of q, we have in general for any q P m:

i Prms qi plogt qi logt qiq

i Prms qi ˆ 1 1 t q1 t i 1 1 1 t q1 t i 1

i Prms q2 t i 1 1 t ÿ

i Prms qiq1 t i

i Prms qiq1 t i

and we can check that for any q, q P m, E p q}qq. We then develop p q}qq with a partial replacement of q by its expression:

i qi logt qi

i qiq1 t i 1 1 t ÿ

1 1 t ˆ 1 1 p1 tqλ 1 ÿ

λ 1 p1 tqλ ÿ

λ 1 p1 tqλ p1 p1 tq p q}qqq .

Rearranging gives that for any q, q P m such that (i) q has the form (25) for some µ P R and (ii) q Ju 0, p q}qq λ logtp Ztq,

as claimed. This ends the proof of Lemma B.

We thus get from the deﬁnition of the dual that µ arg max logt Ztpµq arg min Ztpµq. We have the explicit form for Zt:

i exp2 t t plogt qi µuiq

i Prms exp2 t t plogt qi µuiq

where rms is deﬁned in (28). We remark that the last expression is differentiable in µ, and get

Z1 tpµq 1 2 t

i Prms exp2 t t plogt qi µuiq

i Prms exp1 t t plogt qi µuiq expt t plogt qi µuiq ui

i Prms expt plogt qi µuiq ui

i Prms qiui

i Prms qiui

Zt t q Ju, (30)

B logtp Ztq

Bµ Z t t Z1 t

and we get that any critical point of Ztpµq satisﬁes qpµq Ju 0. A sufﬁcient condition to have just one critical point, being the minimum sought is the strict convexity of Ztpµq. The next Lemma provides the proof that it is for all t ą 0.

Lemma C. Z2 t pµq ě t Ztpµq2t 1p qpµq Juq2.

Proof. After simpliﬁcations, we have

Z3 2t t Z2 t pt 1q

i Prms expt plogt qi µuiq ui

i Prms exp2 t t plogt qi µuiq

i Prms expt t plogt qi µuiq u2 i

i,k Prms Qi Qkuiuk ÿ

i,k Prms Q2 t i Qt ku2 k, (33)

where we have let Qi . expt plogt qi µuiq ě 0. Since a2 b2 ě 2ab, we note that for any i k,

Q2 t i Qt ku2 k Q2 t k Qt iu2 i ě 2 b

Q2 t i Qt k Q2 t k Qt iuiuk 2Qi Qkuiuk, (34)

so we split (33) in two terms and get

Z3 2t t Z2 t pt 1q ÿ

i Prms Q2 i u2 i ÿ

i Prms Q2 t i Qt iu2 i

i,k Prms,iăk 2pt 1q Qi Qkuiuk ÿ

i,k Prms,iăk Q2 t i Qt ku2 k Q2 t k Qt iu2 i

i Prms Q2 i u2 i

i,k Prms,iăk 2pt 1q Qi Qkuiuk ÿ

i,k Prms,iăk Q2 t i Qt ku2 k Q2 t k Qt iu2 i

i Prms Q2 i u2 i 2t ÿ

i,k Prms,iăk Qi Qkuiuk (35)

i Prms expt plogt qi µuiq ui

t Z2 t p q Juq2, (36)

where we have used (34) in (35). Since Ztpµq ą 0, we get the statement of Lemma C after reorganising (36).

Lemma C shows the strict convexity of Ztpµq for any t ą 0. The case t 0 follows by direct differentiation: we get after simpliﬁcation

i Prms u2 i ř

i Prmspqi µuiq2 ř

i Prmspqi µuiqui 2

i Prmspqi µuiq2 3

Cauchy-Schwartz inequality allows to conclude that Z2 t pµq ě 0 and is in fact ą 0 unless q is collinear to u. This completes the proof of Theorem 1.

II.2 Proof of Theorem 2

The proof involves several arguments, organized into several subsections. Some are more general than what is strictly needed for the proof of the Theorem, on purpose.

II.II.2.1 Clipped summations

For any δ ě 0, we deﬁne clipped summations of the sequence of ordered elements v1, v2, ..., v J: if J ą 1,

j 1 vj . min

j 1 vj . max

and the base case (J 1) is obtained by replacing the inner sum by 0. We also deﬁne the doubly clipped summation:

j 1 vj . max

with the same convention for the base case. We prove a series of simple but useful properties of the clipped summation.

Lemma D. The following properties hold true for clipped summation:

1. (doubly) clipped summations are noncommutative;

2. (doubly) clipped summations are ordinary summation in the limit: for any J ě 1 and any sequence v1, v2, ..., v J,

j 1 vj lim δÑ 8

j 1 vj lim δÑ 8

3. clipped summations sandwich ordinary summation and the doubly clipped summation: for any δ ě 0, any J ě 1 and any sequence v1, v2, ..., v J,

Proof. Noncommutativity follows from simple counterexamples: for example, for v . 1 and

w . 2, if we ﬁx v1 . v, v2 . w, then p0qÿ2

j 1 vj 1 while p0qÿ2

j 1 v3 j 1. Property [2.] is trivial. The set of leftmost inequalities of property [3.] can be shown by induction, noting the base case is trivial and otherwise, using the induction hypothesis in the leftmost inequality,

j 1 vj . min

and similarly

j 1 vj . max

A similar argument holds for the set of rightmost inequalities: for example, the induction s general case holds

j 1 vj . min

for the leftmost inequality. This ends the proof of Lemma D.

II.II.2.2 Unravelling weights

Lemma E. Deﬁne

convention :

Then @J ě 1, weights unravel as:

1 mt śJ j 1 Ztj expt

ˆ p1{1 tqÿJ

1 mt śJ j 1 Ztj expt

Proof. We start for the case t ă 1. We proceed by induction, noting ﬁrst that the normalization constraint for the initial weights imposes q1i 1{m1{p2 tq 1{mt and so (using p1 tqt 1 t )

q2i exptplogt q1i µ1u1iq

1 Z1 1 p1 tq ˆ 1 1 t ˆ 1

m 1 t 2 t 1 µ1u1i

1 Z1 1 m1 t p1 tqµ1u1i

1 mt Z1 1 p1 tqm1 t µ1u1i ı 1 1 t

1 mt Z1 expt

completing the base case. Using the induction hypothesis, we unravel at iteration J 1:

exptplogt q Ji µJu Jiq

ˆ 1 mt śJ 1 j 1 Ztj expt

ˆ p1{1 tqÿJ 1

ˆ logt expt

ˆ p1{1 tqÿJ 1

at logt mt śJ 1 j 1 Ztj µJu Ji

max " 1 1 t, p1{1 tqÿJ 1

* logt mt śJ 1 j 1 Ztj

1 p1 tq logt mt śJ 1 j 1 Ztj µJu Ji

% 1 1 t , p1{1 tqÿJ 1

- p1 tq logt mt śJ 1 j 1 Ztj

1 p1 tq logtpmt śJ 1 j 1 Ztjq p1 tqµJu Ji

% 1 1 t , p1{1 tqÿJ 1

- ˆ mt śJ 1 j 1 Ztj 1 t 1

pmt śJ 1 j 1 Ztjq 1 t

p1 tqµJu Ji

which simpliﬁes into (using p1 tqt 1 t )

mt śJ j 1 Ztj

J 1 p1{1 tqÿ

mt śJ j 1 Ztj expt p SJq ,

J 1 p1{1 tqÿ

- v Ju Ji, 1 1 t

%v Ju Ji min

J 1 p1{1 tqÿ

J 1 p1{1 tqÿ

j 1 vjuji, 1 1 t

(we used twice the deﬁnition of clipped summation), which completes the proof of Lemma E for t ă 1.

We now treat the case t ą 1. The base induction is equivalent, while unraveling gives, instead of (39):

mt śJ j 1 Ztj

mt śJ j 1 Ztj expt p SJq ,

and, this time,

- v Ju Ji, 1 t 1

%v Ju Ji max

j 1 vjuji, 1 t 1

j 1 vjuji, (43)

which completes the proof of Lemma E.

II.II.2.3 Introducing classiﬁers

Ordinary linear separators Suppose we have a classiﬁer

j 1 β1 t j µj hjpxq, βj . mt j 1 ź

where µj P R, @j P r Js. We remark that Jz r K ď exp2 t t p zrq for any t ď 2 and z, r P R, and z ÞÑ exp2 t t p zq is decreasing for any t ď 2, so using [3.] in Lemma D, we get for our training

sample S . tpxi, yiq, i P rmsu and any t ă 1 (from Lemma E),

i Prms Jsignp HJpxiqq yi K

ˆ řJ j 1 m1 t śj 1 k 1 Ztk 1 t µj yihjpxiq

ˆ p1{1 tqÿJ

j 1 m1 t śj 1 k 1 Ztk 1 t µj yihjpxiq

ˆ p1{1 tqÿJ

vj . m1 t j 1 ź

µj ; uji . yihjpxiq. (45)

Using Lemma E with those deﬁnitions, we get

i Prms Jsignp HJpxiqq yi K ď ÿ

qp J 1qimt śJ j 1 Ztj 2 t

j 1 Z2 t tj ÿ

i Prms q2 t p J 1qi

j 1 Z2 t tj , (46)

because q J P m. We thus have proven the following Lemma.

Lemma F. For any t ă 1 and any linear separator

j 1 β1 t j µj hjpxq,

βj . mt j 1 ź

k 1 Ztk, µj P R, hj P RX, @j P r Js

where Ztk is the normalization coefﬁcient of q in (25) with uji . yihjpxiq,

i Prms Jsignp HJpxiqq yi K ď

j 1 Z2 t tj . (47)

Clipped linear separators Suppose we have a classiﬁer (t ă 1)

Hp1{1 tq J pxq .

j 1 β1 t j µj hjpxq, βj . mt j 1 ź

We can now replace (44) by

i Prms Jsignp Hp1{1 tq J pxiqq yi K

ˆ yi p1{1 tq

j 1 m1 t śj 1 k 1 Ztk 1 t µj hjpxiq

j 1 m1 t śj 1 k 1 Ztk 1 t µj yihjpxiq

ˆ p1{1 tqÿJ

j 1 m1 t śj 1 k 1 Ztk 1 t µj yihjpxiq

ˆ p1{1 tqÿJ

The ﬁrst identity has used the fact that yi P t 1, 1u, so it can be folded in the doubly clipped summation without changing its value, and the second inequality used [3.] in Lemma D. This directly leads us to the following Lemma.

Lemma G. For any t ă 1 and any clipped linear separator

Hp1{1 tq J pxq .

j 1 β1 t j µj hjpxq,

βj . mt j 1 ź

k 1 Ztk, µj P R, hj P RX, @j P r Js

where Ztk is the normalization coefﬁcient of q in (25) with uji . yihjpxiq,

i Prms Jsignp Hp1{1 tq J pxiqq yi K ď

j 1 Z2 t tj . (49)

II.II.2.4 Geometric convergence of the empirical risk

To get the right-hand side of (47) and (49) as small as possible, we can independently compute each µj so as to minimize

Z2 t tj pµq . ÿ

i Prms exp2 t t plogt qji µujiq . (50)

We proceed in two steps, ﬁrst computing a convenient upperbound for (50), and then ﬁnding the µ that minimizes this upperbound.

Step 1: We distinguish two cases depending on weight qji. Let rms j . ti : qji ą 0u and rms: j . ti : qji 0u:

Case 1 i P rms j . Let rji . uji{q1 t ji and suppose Rj ą 0 is a real that satisﬁes

|rji| ď Rj, @i P rms j . (51)

For any convex function f deﬁned on r 1, 1s, we have fpzq ď pp1 zq{2q fp1q pp1 zq{2q fp 1q, @z P r 1, 1s (the straight line is the chord crossing f at z 1, 1). Because

2 t 1 t is convex for t ď 2, for any i P rms j

exp2 t t plogt qji µujiq

q1 t ji p1 tqµuji 2 t

q2 t ji 1 p1 tqµRj rji

ď q2 t ji Rj rji

2Rj r1 p1 tqµRjs

2 t 1 t q2 t ji Rj rji

2Rj r1 p1 tqµRjs

q2 t ji Rj qjiuji

2Rj r1 p1 tqµRjs

2 t 1 t q2 t ji Rj qjiuji

2Rj r1 p1 tqµRjs

Case 2 i P rms: j. Let q: j ą 0 be a real that satisﬁes

q: j 1 t ă Rj, @i P rms: j. (52)

Using the same technique as in case 1, we ﬁnd for any i P rms: j

exp2 t t plogt qji µujiq

ˆ 1 1 t µuji

r p1 tqµujis

ď q: j 1 t p1 tqµuji ı 2 t

ď q: j 2 t Rj q: juji 2Rj r1 p1 tqµRjs

2 t 1 t q: j 2 t Rj q: juji 2Rj r1 p1 tqµRjs

Folding both cases into one and letting

q1 ji . " qji if i P rms j q: j if i P rms: j , (53)

we get after summation, using m: j . Cardprms: jq and

p1 m: jq: j 2 tq Rj ÿ

i Prms q1 jiuji p P r 1, 1sq, (54)

Z2 t tj pµq

ď p1 m: jq: j 2 tq Rj 2Rj ˆ p1 ρjq r1 p1 tqµRjs

2 t 1 t p1 ρjq r1 p1 tqµRjs

1 m: jq: j 2 t

2 p1 ρjq exp2 t t p µRjq p1 ρjq exp2 t t pµRjq . (55)

Step 2: we have our upperbound for (50). We now compute the minimizer µ of (55). If this minimizer satisﬁes

|µ | ă 1 Rj|1 t|, (56)

then it can be found by ordinary differentiation, as the solution to

p1 ρjq expt pµ Rjq p1 ρjq expt p µ Rjq 0,

which is equivalently

expt p µ Rjq

expt pµ Rjq expt p µ Rj at µ Rjq

1 ρj 1 ρj ,

where we recall a at b . pa bq{p1 p1 tqbq. Solving it yields

µ 1 Rj 1 1 t ˆp1 ρjq1 t p1 ρjq1 t

p1 ρjq1 t p1 ρjq1 t

1 Rj 1 1 t ˆ 2p1 ρjq1 t

p1 ρjq1 t p1 ρjq1 t 1

ˆ 1 ρj M1 tp1 ρj, 1 ρjq

where Mqpa, bq . ppaq bqq{2q1{q is the power mean with exponent q. We now check (56).

Lemma H. For any t P R, let

ˆ 1 ρj M1 tp1 ρj, 1 ρjq

Then |µj| ď 1{p Rj|1 t|q.

Proof. Equivalently, we must show ˇˇˇˇlogt

ˆ 1 z M1 tp1 z, 1 zq

ˇˇˇˇ ď 1 |1 t|, @z P r 1, 1s,

which is equivalent to showing

ˇˇˇˇ 2p1 zq1 t

p1 zq1 t p1 zq1 t 1 ˇˇˇˇ

1 1 z 1 z 1 t

1 1 z 1 z 1 t

ď 1, @z P r 1, 1s.

Deﬁne function fpz, tq . p1 z1 tq{p1 z1 tq over Rě0 ˆ R: it is easy to check that for t ď 1, fpz, tq P r 1, 1s, and the symmetry fpz, tq fpz, 2 tq also allows to conclude that for t ě 1, fpz, tq P r 1, 1s. This ends the proof of Lemma H.

For the expression of µj in (57), we get from (55) the upperbound on Z2 t tj pµjq:

Z2 t tj pµjq ď 1 m: jq: j 2 t

2 p1 ρjq exp2 t t p µj Rjq p1 ρjq exp2 t t pµj Rjq

1 m: jq: j 2 t

p1 ρjqp1 ρjq2 t

M 2 t 1 t p1 ρj, 1 ρjq p1 ρjqp1 ρjq2 t

M 2 t 1 t p1 ρj, 1 ρjq

1 m: jq: j 2 t p1 ρ2 jq M 1 t 1 t p1 ρj, 1 ρjq

M 2 t 1 t p1 ρj, 1 ρjq

1 m: jq: j 2 t 1 ρ2 j M1 tp1 ρj, 1 ρjq.

We conclude that for both sets of classiﬁers deﬁned in Lemmata F and G, with the choice of µj in (57), we get

i Prms Jsignp Hpxiqq yi K ď

1 m: jq: j 2 t 1 ρ2 j M1 tp1 ρj, 1 ρjq, @H P t HJ, Hp1{1 tq J u.

To complete the proof of Theorem 2, we just need to elicit the best Rj (51) and q: j (52); looking at their constraints suggests

Rj . max i Rrms: j

q: j . maxi Prms: j |yihjpxiq|1{p1 tq

R1{p1 tq j .

This completes the proof of Theorem 2. We complete the proof by two Lemmata of additional useful results in the context of Algorithm t-ADABOOST, and ﬁnally an important remark on the interpretation of Theorem 2.

Lemma I. The following holds true: (i) ρj P r 1, 1s; (ii) if, among indexes not in rms: j, there exists at least one index with uji ą 0 and one index with uji ă 0, then for any µ 0, Z2 t tj pµq ą 0 in (50) (in words, the new weigh vector qj 1 cannot be the null vector before normalization).

Proof. To show (i) for ρj ď 1, we write (using uji . yihjpxiq, @i P rms for short),

p1 m: jq: j 2 tq Rj ρj ÿ

i Prms q1 jiuji

i Prms q1 ji|uji|

q2 t ji |uji|

q1 t ji q: j 2 t

i Prms: j |uji|

looooomooooon

q: j 2 t Rj ř

i Prms: j |uji|

maxi Prms: j |uji|

ď Rj q: j 2 tm: j Rj p1 m: jq: j 2 tq Rj, showing ρj ď 1. Showing ρj ě 1 proceeds in the same way. Property (ii) is trivial.

Ktpzq ď exp ˆ ˆ 1 t

Proof. We remark that for t P r0, 1q, z ě 0, K1 tpzq is concave and K2 t p0q p2 tq, so K1 tpzq ď p2 tqz, @z ě 0, from which it follows by integration

Ktpzq ď 1 ˆ 1 t

and since 1 z ď expp zq, we get the statement of the Lemma.

Remark 1. The interpretation of Theorem 2 for t ă 1 are simpliﬁed to the case where there is no weight switching, i.e. m: j 0, @j. While we have never observed weight switching in our experiments perhaps owing to the fact that we did never boost for a very long number of iterations or just because our weak classiﬁers, decision trees, were in fact not so weak , it is interesting, from a theoretical standpoint, to comment on convergence when this happens. Let Qj . 1 m: jpq: jq2 t and ρj . Qjρj (Notations from Theorem 2). We note that ρj β Epjryihjpxiqs, where pj lives on the simplex and |yh| ď 1, β ď 1. Using Lemma J and (12) (main ﬁle), to keep geometric convergence, it is roughly sufﬁcient that Qj log Qj ď p ρjq2{p2t q. Since q: j is homogeneous to a tempered weight, one would expect in general m: jpq: jq2 t ! 1, so using the Taylor approximation Qj log Qj 1 Qj, one gets the reﬁned sufﬁcient condition for geometric convergence

m: jpq: jq2 t ď p ρjq2{p2t q Opp ρjq2q. What does that imply? We have two cases:

If this holds, then we have geometric convergence;

if it does not hold, then for a "large" number of training examples, we must have qji 0 which, because of the formula for q (8) implies that all these examples receive the right class with a sufﬁciently large margin. Breaking geometric convergence in this case is not an issue: we already have a good ensemble.

II.3 Proof of Theorem 3

Starting from the proof of Theorem 2, we indicate the additional steps to get to the proof of Theorem 3. The key is to remark that our margin formulation has the following logical convenience:

Jνtppxi, yiq, Hq ď θK J y Hpxq logt

p1 tqy Hpxq logt

Jp y Hpxqq t logt

We then remark that since Jz ě 0K ď exp2 t t pzq, we get

Jνtppxi, yiq, Hq ď θK ď exp2 t t

ˆ p y Hpxqq t logt

exp2 t t p y Hpxqq

2 t exp2 t t p y Hpxqq.

We then just have to branch to (44), replacing the Jsignp HJpxiqq yi Ks by Jνtppxi, yiq, Hq ď θK, which yields in lieu of (46) the sought inequality:

Ft,θp H, Sq ď ˆ1 θ

j 1 Z2 t tj . (58)

II.4 Proof of Theorem 4

The proof proceeds in three parts. Part (A) makes a brief recall on encoding linear classiﬁers with decision trees. Part (B) solves (6) in MF, i.e. ﬁnds boosting s leveraging coefﬁcients as solution of:

qpµq Ju 0. (59)

we then simplify the loss obtained and elicit the conditional Bayes risk of the tempered loss, i.e. (20) in MF. Part (C) elicits the partial losses and shows properness and related properties.

Part (A): encoding linear models with a tree architecture We use the reduction trick of (author?) [16] to design a decision tree (DT) boosting procedure, ﬁnd out the (concave) loss equivalently minimized, just like in classical top-down DT induction algorithms [6]. The trick is simple: a DT can be thought of as a set of constant linear classiﬁers. The prediction is the sum of predictions put at all nodes. Boosting ﬁts those predictions at the nodes and percolating those to leaves gets a standard DT with real predictions at the leaves. Figure 2 provides a detailed description of the procedure. Let λ denote a leaf node of the current tree H, with Hλ P R the function it implements for leaf λ. If parentpλq denotes its parent node (assuming wlog it is not the root node), we have

Hλ . Hparentpλq µλhλ, (60)

Part (B): eliciting the Bayes risk of the tempered loss With our simple classiﬁers at hand, the tempered exponential loss Z2 t tj in (14) (MF) can be simpliﬁed to loss

i exp2 t t logt q1i yi Hλpxiq

λPΛp Hq m λ exp2 t t plogt q1i Hλq m λ exp2 t t plogt q1i Hλq , (61)

yuvxa7Mfp89MXozmg8+nb0c PTDa H90MKj ZPTb6Pf RH6uv V/9c/Wv174p6/Vptszrqf Fb/+Rfp/s VE</latexit>µ5b5

7X4vd GH0+m J0Zz Qef Tt6OPpht D86GOFRMvpt9Pvoj9Xq3+u/r X6d0W9fq2WR1Pqv/Av C4s VC</latexit>µ4b4

7hdfy120/v E+9S76428b7y H3vfevnfoh R72fv N+9/5YFat/rv61+nd Ffe NGLb Pqd T6r/w Ln Ab CNw=</latexit>+ +

x2 a2 x2 < a2

x3 a3 x3 < a3

x2 a2 x2 < a2

x3 a3 x3 < a3

Figure 2: The weak learner provides weak hypotheses of the form Jxk ě aj K bj. From the boosting standpoint, this weak hypothesis is "as good as" the weak hypothesis hjpxq . Jxk ă aj K bj. The predicates of both are used to craft a split, e.g. for the root (in our depiction, b3 b2) and then solving (59) provides the leveraging coefﬁcients µ.. We then repeat this for as many splits as necessary. At the end, we can "percolate" nodes reals towards the leaves below and get an equivalent classiﬁer that resembles a decision tree (right). See [16] for further details.

where λpxq is the leaf reached by observation x and λp Hq its set of leaf nodes of H, and Hλ sums all relevant values in (60). Also, m λ , m λ denote the cardinal of positive and negative examples at λ and pλ . m λ {pm λ m λ q the local proportion of positive examples at λ, and ﬁnally rλ . pm λ m λ q{m the total proportion of examples reaching λ. Theorem A. If we compute µλ the solution of (59), we end up with the prediction Hλ:

Hλ q1 t 1i 1 t

1 t 1 m λ m λ

q1 t 1i 1 t p1 t λ p1 pλq1 t

p1 t λ p1 pλq1 t , (63)

and the loss of the decision tree equals:

λPΛp Hq rλ 2pλp1 pλq M1 tppλ, 1 pλq, (64)

Eλr Lptqppλqs. (65)

Proof. To compute µλ, (6) is reduced to the examples reaching λ, that is, it simpliﬁes to

m λ expt logt q1i Hparentpλq Rλµλhλ m λ expt logt q1i Hparentpλq Rλµλhλ (66)

that we solve for µλ. Equivalently,

expt logt q1i Hparentpλq Rλµλhλ

expt logt q1i Hparentpλq Rλµλhλ m λ m λ ,

or, using exptpuq{ exptpvq exptpu at vq,

2Hparentpλq 2Rλµλhλ 1 p1 tqplogt q1i Hparentpλq Rλµλhλq logt

after reorganizing:

Rλµλhλ p1 p1 tqplogt q1i Hparentpλqqq logt m λ m λ

2Hparentpλq

2 p1 tq logt m λ m λ

which yields the prediction at λ:

Hλ Hparentpλq p1 p1 tqplogt q1i Hparentpλqqq logt m λ m λ

2Hparentpλq

2 p1 tq logt m λ m λ

p1 p1 tq logt q1iq logt m λ m λ

2 p1 tq logt m λ m λ

q1 t 1i logt m λ m λ

2 p1 tq logt m λ m λ

q1 t 1i 1 t

1 t 1 m λ m λ

q1 t 1i 1 t p1 t λ p1 pλq1 t

p1 t λ p1 pλq1 t . (71)

We plug Hλ back in the loss for all leaves and get, using q1i 1{m1{p2 tq:

m λ exp2 t t logt q1i q1 t 1i 1 t p1 t λ p1 pλq1 t

p1 t λ p1 pλq1 t

m λ exp2 t t logt q1i q1 t 1i 1 t p1 t λ p1 pλq1 t

p1 t λ p1 pλq1 t . (72)

We simplify. First,

m λ exp2 t t

ˆ logt q1i q1 t 1i 1 t p1 t λ p1 pλq1 t

p1 t λ p1 pλq1 t

q1 t 1i ˆ 1 p1 t λ p1 pλq1 t

p1 t λ p1 pλq1 t

m λ m 2p1 pλq1 t

p1 t λ p1 pλq1 t

m λ m ˆ 1 pλ M1 tppλ, 1 pλq

m λ exp2 t t

ˆ logt q1i q1 t 1i 1 t p1 t λ p1 pλq1 t

p1 t λ p1 pλq1 t

q1 t 1i ˆ 1 p1 t λ p1 pλq1 t

p1 t λ p1 pλq1 t

m λ m 2p1 t λ p1 t λ p1 pλq1 t

m λ m ˆ pλ M1 tppλ, 1 pλq

and we can simplify the loss,

λPΛp Hq rλpλ

ˆ 1 pλ M1 tppλ, 1 pλq

2 t rλp1 pλq ˆ pλ M1 tppλ, 1 pλq

λPΛp Hq rλ pλp1 pλq2 t p1 pλqp2 t λ M 2 t 1 t ppλ, 1 pλq (78)

λPΛp Hq rλ pλp1 pλq pp1 t λ p1 pλq1 tq M 2 t 1 t ppλ, 1 pλq (79)

λPΛp Hq rλ 2pλp1 pλq M 1 t 1 t ppλ, 1 pλq M 2 t 1 t ppλ, 1 pλq (80)

λPΛp Hq rλ 2pλp1 pλq M1 tppλ, 1 pλq, (81)

as claimed. This ends the proof of Theorem A.

Part (C): partial losses and their properties The proof relies on the following Theorem. We recall that a loss is symmetric iff its partial losses satisfy ℓ1puq ℓ 1p1 uq, @u P r0, 1s [20] and differentiable iff its partial losses are differentiable.

Theorem B. Suppose t ă 2. A set of partial losses having the conditional Bayes risk Lptq in (65) are

ℓptq 1 puq . ˆ 1 u M1 tpu, 1 uq

2 t , ℓptq 1puq . ℓptq 1 p1 uq. (82)

The tempered loss is then symmetric and differentiable. It is strictly proper for any t P p 8, 2q and proper for t 8.

Proof. Symmetry and differentiability are straightforward. To check strict properness, we analyze the cases for t 1 (otherwise, it is Matusita s loss, and thus strictly proper), we compute the solution u to B Bu Lpu, vq 0. (83)

To this end, let Npuq . vp1 uq2 t p1 vqu2 t and the q-sum

Sqpa, bq . paq bqq1{q 21{q Mqpa, bq. (84)

We also let Dpuq . S2 t 1 tpu, 1 uq. Noting L ptqpu, vq 2 2 t 1 t Npuq{Dpuq and Dpuq 0, @u P r0, 1s, the set of solutions of (83) are the set of solutions to N 1puq Dpuq Npuq D1puq, which boils down, after simpliﬁcation, to

pp1 vqu1 t vp1 uq1 tq S2 t 1 tpu, 1 uq

pvp1 uq2 t p1 vqu2 tqpu t p1 uq tq S1 tpu, 1 uq,

developing and simplifying yields a ﬁrst simpliﬁed expression p1 2vqpup1 uqq1 t vu 1p1 uq2 t p1 vqp1 uq tu2 t, which, after reorganising to isolate expressions depending on v, yields

pup1 uqq1 t p1 uq tu2 t v u tp1 uq2 t p1 uq tu2 t 2pup1 uqq1 t . (85)

Assuming v P p0, 1q, we multiply by pup1 uqq1 t (we shall check u P p0, 1q) and simplify, which yields up1 uq u2 vpp1 uq2 u2 2up1 uqq, and indeed yields

and we check from (85) that if v 0 (resp. v 1), then necessarily u 0 (resp. u 1). To complete the proof, using the previous derivations, we can then simplify

B Bu Lpu, vq p2 tq 2 2 t 1 t u v pup1 uqqt S3 2t 1 t pu, 1 uq, (87)

which shows that if 2 t ą 0 but t 8, u v is a strict minimum of the pointwise conditional risk, completing the proof for strict properness. Strict properness is sufﬁcient to show by a simple computation that Lptq is (65). For t 8, we pass to the limit and use the fact that we can also write

ℓptq 1 puq 1

M1 t ˆ 1, u 1 u 1

t pwe recall t . 1{p2 tqq (88)

t Ñ 8 is equivalent to t Ñ 0 . If u ă 1{2, u{p1 uq ă 1 and so we see that

lim t Ñ0 M1 t

because M1 is the arithmetic mean. When u ą 1{2, u{p1 uq ą 1 and so this time

lim t Ñ0 M1 t

ℓp 8q 1 puq 2 Ju ď 1{2K, (89)

which is (twice) the partial loss of the 0/1 loss [26].

This ends the proof of Theorem 4.

Domain Source m d sonar UCI 208 60 https://archive.ics.uci.edu/ml/datasets/wine+quality winered UCI 1 599 12 https://archive.ics.uci.edu/ml/datasets/wine+quality abalone UCI 4 177 9 https://archive.ics.uci.edu/ml/datasets/abalone qsar UCI 1 055 41 https://archive.ics.uci.edu/ml/datasets/QSAR+biodegradation winewhite UCI 4 898 12 https://archive.ics.uci.edu/ml/datasets/wine+quality hillnonoise UCI 1 212 101 http://archive.ics.uci.edu/ml/datasets/hill-valley hillnoise UCI 1 212 101 http://archive.ics.uci.edu/ml/datasets/hill-valley eeg UCI 14 980 15 https://archive.ics.uci.edu/ml/datasets/EEG+Eye+State creditcard UCI 14 599 24 https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients adult UCI 32 561 15 https://archive.ics.uci.edu/ml/datasets/adult

Table A3: Public domains considered in our experiments (m total number of examples, d total number of example s features, including class), ordered in increasing m ˆ d (see text). (*) ﬁrst m rows in the domain.

III Supplementary material on experiments

III.1 Domains

Table A3 presents the 10 domains we used for our experiments.

III.2 Implementation details and full set of experiments on linear combinations of decision trees

Summary This Section depicts the full set of experiments summarized in Table 2 (MF), from Table A4 to Table A15. Tables are ordered in increasing size of the domain (Table A3). In all cases, up to J 20 trees have been trained, of size 15 (total number of nodes, except the two biggest domains, for which the size is 5). For all datasets, except creditcard and adult, we have tested t in the complete range, t P t0.0, 0.2, 0.4, 0.6, 0.8, 0.9, 1.0, 1.1u (the MF only reports results for t ě 0.6), and in all cases, models both clipped and not clipped. For each dataset, we have set a 10-fold stratiﬁed cross-validation experiment, and report the averages for readability (Table 2 in MF gives the results of a Student paired t-test on error averages for comparison, limit p-val = 0.1). We also provide two examples of training error averages for domains hillnoise and hillnonoise (Tables A10 and A12).

Implementation details of t-ADABOOST First, regarding ﬁle format, we only input a .csv ﬁle to t-ADABOOST. We do not specify a ﬁle with feature types as in ARFF ﬁles. t-ADABOOST recognizes the type of each feature from its column content and distinguishes two main types of features: numerical and categorical. The distinction is important to design the splits during decision tree induction: for numerical values, splits are midpoints between two successive observed values. For categorical, splits are partitions of the feature values in two non-empty subsets. Our implementation of t-ADABOOST (programmed in Java) makes it possible to choose t not just in the range of values for which we have shown that boosting-compliant convergence is possible (t P r0, 1s), but also t ą 1. Because we thus implement ADABOOST (t 1) but also for t ą 1, weights can fairly easily become inﬁnite, we have implemented a safe-check during training, counting the number of times the weights become inﬁnite or zero (note that in this latter case, this really is a problem just for ADABOOST because in theory this should never happen unless the weak classiﬁers achieve perfect (or perfectly wrong) classiﬁcation), but also making sure leveraging coefﬁcients for classiﬁers do not become inﬁnite for ADABOOST, a situation that can happen because of numerical approximations in encoding. In our experiments, we have observed that none of these problematic cases did occur (notice that this could not be the case if we were to boost for a large number of iterations). We have implemented algorithm t-ADABOOST exactly as speciﬁed in MF. The weak learner is implemented to train a decision tree in which the stopping criterion is the size of the tree reaching a user-ﬁxed number of nodes. There is thus no pruning. Also, the top-down induction algorithm proceeds by iteratively picking the heaviest leaf in the tree and then choosing the split that minimizes the expected Bayes risk of the tempered loss, computing using the same t values as for t-ADABOOST, and with the constraint to not get pure leaves (otherwise, the real prediction at the leaves, which relies on the

perr (not clipped) perr (clipped) min weight max weight

Table A4: Experiments on t-ADABOOST comparing with ADABOOST (t 1, bullets) on domain sonar, when trained without noise (η 0.0, top row) and with noise (η 0.1, bottom row). Columns are, from left to right, the estimated true error of non-clipped and clipped models, and the min and max codensity weights. The set of t values used is displayed in each plot with a colormap (right), and varying thickness of curves for an additional ease of reading (the thicker the curve, the larger t). ADABOOST s reference results are displayed with bullets. Averages shown for readability.

link of the loss, would be inﬁnite for ADABOOST). In our implementation of decision-tree induction, when the number of possible splits exceeds a ﬁxed number S (currently, 2 000), we pick the best split in a subset of S splits picked at random.

Results First, one may notice in several plots that the average test error increases with the number of trees. This turns out to be a sign of overﬁtting, as exempliﬁed for domains hillnonoise and hillnoise, for which we provide the training curves. If we align the training curves at T 1 (the value is different because the splitting criterion for training the tree is different), we notice that the experimental convergence on training is similar for all values of t (Tables A10 and A12). The other key experimental result, already visible from Table 2 (MF), is that pretty much all tested values of t are necessary to get the best results. One could be tempted to conclude that t slightly smaller than 1.0 seems to be a good ﬁt from Table 2 (MF), but the curves show that this is more a consequence of the Table being computed for J 20 trees. The case of eeg illustrates this phenomenon best: while small t-values are clearly the best when there is no noise, the picture is completely reversed when there is training noise. Notice that this ordering is almost reversed on creditcard and adult: when there is noise, small values of t tend to give better results. Hence, in addition to getting (i) a pruning mechanism that works for all instances of the tempered loss and (ii) a way to guess the right number of models in the ensemble, a good problem to investigate is in fact appropriately tuning t in a domain-dependent way. Looking at all plots reveals that substantial gains could be obtained with an accurate procedure (over the strategy that would be to always pick a ﬁxed t, e.g. t 1).

III.3 Supplementary experiments: learning with more trees / against more noise

We have performed some additional experiments on several domains, on which we have trained bigger ensembles and / or considered more noise levels than in the previous experiments, with the objective to see if / when overﬁtting happens and how performances degrade with noise as a function of t. Table A16 summarizes the results obtained. A few comments we can make based on those experiments are:

overﬁtting can indeed happen (sonar for η 0.4) but affects differently the algorithm depending on t and whether clipped models are learned instead of regular linear models; results also display that tuning t can also have the purpose of handling overﬁtting;

perr (not clipped) perr (clipped) min weight max weight

Table A5: Experiments on t-ADABOOST comparing with ADABOOST (t 1, bullets) on domain winered. Conventions follow Table A4.

perr (not clipped) perr (clipped) min weight max weight

Table A6: Experiments on t-ADABOOST comparing with ADABOOST (t 1, bullets) on domain abalone. Conventions follow Table A4.

clamped models can be very useful to handle overﬁtting (sonar for η 0.4, qsar for η ě 0.2); this provides another justiﬁcation to learn clamped models; the overall diversity of curves as a function of t supports the idea that good strategies could in fact tune t at training time and change its value with iterations.

perr (not clipped) perr (clipped) min weight max weight

Table A7: Experiments on t-ADABOOST comparing with ADABOOST (t 1, bullets) on domain qsar. Conventions follow Table A4.

perr (not clipped) perr (clipped) min weight max weight

Table A8: Experiments on t-ADABOOST comparing with ADABOOST (t 1, bullets) on domain winewhite. Conventions follow Table A4.

perr (not clipped) perr (clipped) min weight max weight

Table A9: Experiments on t-ADABOOST comparing with ADABOOST (t 1, bullets) on domain hillnonoise. Conventions follow Table A4.

training err (not clipped) training err (clipped)

Table A10: Experiments on t-ADABOOST comparing with ADABOOST (t 1, bullets) on domain hillnonoise: training errors displayed for all algorithms using conventions from Table A4. See text for details.

perr (not clipped) perr (clipped) min weight max weight

Table A11: Experiments on t-ADABOOST comparing with ADABOOST (t 1, bullets) on domain hillnoise. Conventions follow Table A4.

training err (not clipped) training err (clipped)

Table A12: Experiments on t-ADABOOST comparing with ADABOOST (t 1, bullets) on domain hillnoise: training errors displayed for all algorithms using conventions from Table A4. See text for details.

perr (not clipped) perr (clipped) min weight max weight

Table A13: Experiments on t-ADABOOST comparing with ADABOOST (t 1, bullets) on domain eeg. Conventions follow Table A4.

perr (not clipped) perr (clipped) min weight max weight

Table A14: Experiments on t-ADABOOST comparing with ADABOOST (t 1, bullets) on domain creditcard. Conventions follow Table A4.

perr (not clipped) perr (clipped) min weight max weight

Table A15: Experiments on t-ADABOOST comparing with ADABOOST (t 1, bullets) on domain adult. Conventions follow Table A4.

η 0 0.1 0.2 0.3 0.4

winered, clamped

sonar, unclamped

sonar, clamped

qsar, unclamped

qsar, clamped

Table A16: Additional experiments with a larger number of trees and various levels of noise, on winered (J 150 trees, T 2 splits per tree) and sonar and qsar (J 100, T 2). Note that the range of noise levels tested is broader than for the other experiments.