# how_to_boost_any_loss_function__b3c3ca0f.pdf

How to Boost Any Loss Function

Richard Nock Google Research richardnock@google.com

Yishay Mansour Tel Aviv University Google Research mansour@google.com

Boosting is a highly successful ML-born optimization setting in which one is required to computationally efﬁciently learn arbitrarily good models based on the access to a weak learner oracle, providing classiﬁers performing at least slightly differently from random guessing. A key difference with gradient-based optimization is that boosting s original model does not requires access to ﬁrst order information about a loss, yet the decades long history of boosting has quickly evolved it into a ﬁrst order optimization setting sometimes even wrongfully deﬁning it as such. Owing to recent progress extending gradient-based optimization to use only a loss zeroth (0th) order information to learn, this begs the question: what loss functions can be efﬁciently optimized with boosting and what is the information really needed for boosting to meet the original boosting blueprint s requirements? We provide a constructive formal answer essentially showing that any loss function can be optimized with boosting and thus boosting can achieve a feat not yet known to be possible in the classical 0th order setting, since loss functions are not required to be be convex, nor differentiable or Lipschitz and in fact not required to be continuous either. Some tools we use are rooted in quantum calculus, the mathematical ﬁeld not to be confounded with quantum computation that studies calculus without passing to the limit, and thus without using ﬁrst order information.

1 Introduction

In ML, zeroth order optimization has been devised as an alternative to techniques that would otherwise require access to ě 1-order information about the loss to minimize, such as gradient descent (stochastic or not, constrained or not, etc., see Section 2). Such approaches replace the access to a so-called oracle providing derivatives for the loss at hand, operations that can be consuming or not available in exact form in the ML world, by the access to a cheaper function value oracle, providing loss values at queried points.

Zeroth order optimization has seen a considerable boost in ML over the past years, over many settings and algorithms, yet, there is one foundational ML setting and related algorithms that, to our knowledge, have not yet been the subject of investigations: boosting [32, 31]. Such a question is very relevant: boosting has quickly evolved as a technique requiring ﬁrst-order information about the loss optimized [6, Section 10.3], [41, Section 7.2.2] [53]. It is also not uncommon to ﬁnd boosting reduced to this ﬁrst-order setting [9]. However, originally, the boosting model did not mandate the access to any ﬁrst-order information about the loss, rather requiring access to a weak learner providing classiﬁers at least slightly different from random guessing [31]. In the context of zeroth-order optimization gaining traction in ML, it becomes crucial to understand not just whether differentiability is necessary for boosting, but more generally what are loss functions that can be boosted with a weak learner and in ﬁne where boosting stands with respect to recent formal progress on lifting gradient descent to zeroth-order optimisation.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

In this paper, we settle the question: we design a formal boosting algorithm for any loss function whose set of discontinuities has zero Lebesgue measure. With traditional ﬂoating point encoding (e.g. ﬂoat64), any stored loss function would de facto meet this condition; mathematically speaking, we encompass losses that are not necessarily convex, nor differentiable or Lipschitz. This is a key difference with classical zeroth-order optimization results where the algorithms are zeroth-order but their proof of convergence makes various assumptions about the loss at hand, such as convexity, differentiability (once or twice), Lipschitzness, etc. . Our proof technique builds on a simple boosting technique for convex functions that relies on an order-one Taylor expansion to bound the progress between iterations [45]. Using tools from quantum calculus*, we replace this progress using v-derivatives and a quantity related to a generalisation of the Bregman information [7]. The boosting rate involves the classical weak learning assumption s advantage over random guessing and a new parameter bounding the ratio of the expected weights (squared) over a generalized notion of curvature involving v-derivatives. Our algorithm, which learns a linear model, introduces notable generalisations compared to the Ada Boost / gradient boosting lineages, chief among which the computation of acceptable offsets for the v-derivatives used to compute boosting weights, offsets being zero for classical gradient boosting. To preserve readability and save space, all proofs and additional information are postponed to an Appendix.

2 Related work

Over the past years, ML has seen a substantial push to get the cheapest optimisation routines, in general batch [14], online [27], distributed [3], adversarial [20, 18] or bandits settings [2] or more speciﬁc settings like projection-free [26, 28, 51] or saddle-point optimisation [25, 38]. We summarize several dozen recent references in Table 1 in terms of assumptions for the analysis about the loss optimized, provided in Appendix, Section A. Zeroth-order optimization reduces the information available to the learner to the "cheapest" one which consists in (loss) function values, usually via a so-called function value oracle. However, as Table 1 shows, the loss itself is always assumed to have some form of "niceness" to study the algorithms convergence, such as differentiability, Lipschitzness, convexity, etc. . Another quite remarkable phenomenon is that throughout all their diverse settings and frameworks, not a single one of them addresses boosting. Boosting is however a natural candidate for such investigations, for two reasons. First, the most widely used boosting algorithms are ﬁrst-order information hungry [6, 41, 53]: they require access to derivatives to compute examples weights and classiﬁers leveraging coefﬁcients. Second and perhaps most importantly, unlike other optimization techniques like gradient descent, the original boosting model does not mandate the access to a ﬁrst-order information oracle to learn, but rather to a weak learning oracle which supplies classiﬁers performing slightly differently from random guessing [32, 31]. Only few approaches exist to get to "cheaper algorithms relying on less assumptions about the loss at hand, and to our knowledge do not have boosting-compliant convergence proofs, as for example when alleviating convexity [16, 46] or access to gradients of the loss [54]. Such questions are however important given the early negative results on boosting convex potentials with ﬁrst-order information [37] and the role of the classiﬁers in the negative results [39].

Finally, we note that a rich literature has developed in mathematics as well for derivative-free optimisation [34], yet methods would also often rely on assumptions included in the three above (e.g. [42]). It must be noted however that derivative-free optimisation has been implemented in computers for more than seven decades [24].

3 Deﬁnitions and notations

The following shorthands are used: rns . t1, 2, ..., nu for n P N , z ra, bs . rmintza, zbu, maxtza, zbus for z P R, a ď b P R. In the batch supervised learning setting, one is given a training set of m examples S . tpxi, yiq, i P rmsu, where xi P X is an observation (X is called the domain: often, X Ď Rd) and yi P Y . t 1, 1u is a label, or class. We study the empirical convergence of boosting, which requires fast convergence on training. Such a setting is standard in zeroth order optimization [42]. Also, investigating generalization would entail speciﬁc design choices about the loss at hand and thus would restrict the scope of our result (see e.g. [8]). The objective is to

*Calculus "without limits" [30] (thus without using derivatives), not to be confounded with calculus on quantum devices.

Figure 1: Left: value of SF |vpz1}zq for convex F, v . z4 z and various z1 (colors), for which the Bregman Secant distortion is positive (z1 z1, green), negative (z1 z2, red), minimal (z1 z3) or null (z1 z4, z). Right: depiction of QF pz, z v, z1q for non-convex F (Deﬁnition 4.6).

learn a classiﬁer, i.e. a function h : X Ñ R which belongs to a given set H. The goodness of ﬁt of some h on S is evaluated from a given function F : R Ñ R called a loss function, whose expectation on training is sought to be minimized:

Fp S, hq . Ei rmsr Fpyihpxiqqs.

The set of most popular losses comprises convex functions: the exponential loss (FEXPpzq . expp zq), the logistic loss (F LOGpzq . logp1 expp zqq), the square loss (FSQpzq . p1 zq2), the Hinge loss (FHpzq . maxt0, 1 zu). These are surrogate losses because they all deﬁne upperbounds of the 0/1-loss (F0/1pzq . 1zď0, "1 being the indicator variable).

Our ML setting is that of boosting [31]. It consists in having primary access to a weak learner WL that when called, provides so-called weak hypotheses, weak because barely anything is assumed in terms of classiﬁcation performance relatively to the sample over which they were trained. Our goal is to devise a so-called "boosting" algorithm that can take any loss F as input and training sample S and a target loss value F and after some T calls to the weak learner crafts a classiﬁer HT satisfying Fp S, HT q ď F , where T depends on various parameters of the ML problem. Our boosting architecture is a linear model: HT . ř

t αtht where each ht is an output from the weak learner and leveraging coefﬁcients αt have to be computed during boosting. Notice that this is substantially more general than the classical boosting formulation where the loss would be ﬁxed or belong to a restricted subset of functions.

4 v-derivatives and Bregman secant distortions

Unless otherwise stated, in this Section, F is a function deﬁned over R. Deﬁnition 4.1. [30] For any z, v P R, we let δv Fpzq . p Fpz vq Fpzqq{v denote the v-derivative of F in z.

This expression, which gives the classical derivative when the offset v Ñ 0, is called the h-derivative in quantum calculus [30, Chapter 1]. We replaced the notation for the risk of confusion with classiﬁers. Notice that the v-derivative is just the slope of the secant that passes through points pz, Fpzqq and pz v, Fpz vqq (Figure 1). Higher order v-derivatives can be deﬁned with the same offset used several times [30]. Here, we shall need a more general deﬁnition that accommodates for variable offsets. Deﬁnition 4.2. Let v1, v2, ..., vn P R and V . tv1, v2, ..., vnu and z P R. The V-derivative δVF is:

# Fpzq if V H δv1Fpzq if V tv1u δtvnupδVztvnu Fqpzq otherwise .

If vi v, @i P rns then we write δpnq v Fpzq . δVFpzq.

In the Appendix, Lemma B.1 computes the unravelled expression of δVFpzq, showing that the order of the elements in V does not matter; n is called the order of the V-derivative.

We can now deﬁne a generalization of Bregman divergences called Bregman Secant distortions.

Deﬁnition 4.3. For any z, z1, v P R, the Bregman Secant distortion SF |vpz1}zq with generator F and offset v is:

SF |vpz1}zq . Fpz1q Fpzq pz1 zqδv Fpzq.

Even if F is convex, the distortion is not necessarily positive, though it is lowerbounded (Figure 1). There is an intimate relationship between the Bregman Secant distortions and Bregman divergences. We shall use a deﬁnition slightly more general than the original one when F is differentiable [11, eq. (1.4)], introduced in information geometry [5, Section 3.4] and recently reintroduced in ML [10].

Deﬁnition 4.4. The Bregman divergence with generator F (scalar, convex) between z1 and z is DF pz1}zq . Fpz1q F pzq z1z, where F pzq . supt tz Fptq is the convex conjugate of F.

We state the link between SF |v and DF (proof omitted).

Lemma 4.5. Suppose F strictly convex differentiable. Then limvÑ0 SF |vpz1}zq DF pz1}F 1pzqq.

Relaxed forms of Bregman divergences have been introduced in information geometry [43].

Deﬁnition 4.6. For any a, b, α P R, denote for short Ia,b . rminta, bu, maxta, bus and puvqα . αu p1 αqv. The Optimal Bregman Information (OBI) of F deﬁned by triple pa, b, cq P R3 is:

QF pa, b, cq . max α:pabqαPIa,ctp Fpaq Fpbqqα Fppabqαqu. (1)

As represented in Figure 1 (right), the OBI is obtained by drawing the line passing through pa, Fpaqq and pb, Fpbqq and then, in the interval Ia,c, look for the maximal difference between the line and F. We note that QF is non negative because a P Ia,c and for the choice α 1, the RHS in (1) is 0. We also note that when F is convex, the RHS is indeed the maximal Bregman information of two points in [7, Deﬁnition 2], where maximality is obtained over the probability measure. The following Lemma follows from the deﬁnition of the Bregman secant divergence and the OBI. An inspection of the functions in Figure 1 provides a graphical proof.

Lemma 4.7. For any F,

@z, v, z1 P R, SF |vpz1}zq ě QF pz, z v, z1q. (2)

and if F is convex,

@z, v P R, @z1 R Iz,z v, SF |vpz1}zq ě 0,

@z, v, z1 P R, SF |vpz1}zq ě QF pz, z v, z vq. (3)

We shall abbreviate the two possible forms of OBI in the RHS of (2), (3) as:

Q F pz, z1, vq . " QF pz, z v, z vq if F convex QF pz, z v, z1q otherwise . (4)

5 Boosting using only queries on the loss

We make the assumption that predictions of so-called "weak classiﬁers" are ﬁnite and non-zero on training without loss of generality (otherwise a simple tweak ensures it without breaking the weak learning framework, see Appendix, Section B.2). Excluding 0 ensures our algorithm does not make use of derivatives.

Assumption 5.1. @t ą 0, @i P rms, |htpxiq| P p0, 8q (we thus let Mt . maxi |htpxiq|).

For short, we deﬁne two edge quantities for i P rms and t 1, 2, ...,

eti . αt yihtpxiq, eti . yi Htpxiq, (5)

where αt is a leveraging coefﬁcient for the weak classiﬁers in an ensemble HT p.q . ř

t Pr T s αthtp.q. We observe

eti ept 1qi eti.

Algorithm 1 SECBOOST(S, T) // red boxes pinpoint substantial differences with "classical" boosting

Input sample S tpxi, yiq, i 1, 2, ..., mu, number of iterations T, initial ph0, v0q (constant classiﬁcation and offset). Step 1 : let H0 Ð 1 h0 and w1 δv0Fph0q 1 ; // h0, v0 0 chosen s. t. δv0Fph0q 0 Step 2 : for t 1, 2, ..., T

Step 2.1 : let ht Ð WLp St, |wt|q //weak learner call, St . tpxi, yi signpwtiqqu Step 2.2 : let ηt Ð p1{mq ř

i wtiyihtpxiq //unnormalized edge

If bound on W 2,t available (Section 5.3) otherwise | general procedure pick εt ą 0, πt P p0, 1q and

αt P ηt 2p1 εtq M 2 t W 2,t r1 πt, 1 πts ; (6) αt ÐSOLVEα(S, wt, ht) // W 2,t ą 0, εt ą 0, πt P p0, 1q // Theorem 5.8

Step 2.4 : let Ht Ð Ht 1 αt ht //classiﬁer update Step 2.5 : if Itipεt α2 t M 2 t W 2,tq H, @i P rms then //new offsets for i 1, 2, ..., m, let

vti Ð OOpt, i, εt α2 t M 2 t W 2,tq ;

else return Ht; Step 2.6 : for i 1, 2, ..., m, let //weight update

wpt 1qi Ð δvti Fpyi Htpxiqq ; (7)

Step 2.7 : if wt 1 0 then break; Return HT .

5.1 Algorithm: SECBOOST

5.1.1 General steps

Without further ado, Algorithm SECBOOST presents our approach to boosting without using derivatives information. The key differences with traditional boosting algorithms are red color framed. We summarize its key steps. Step 1 This is the initialization step. Traditionally in boosting, one would pick h0 0. Note that w1 is not necessarily positive. v0 is the initial offset (Section 4). Step 2.1 This step calls the weak learner, as in traditional boosting, using variable "weights" on examples (the coordinate-wise absolute value of wt, denoted |wt|). The key difference with traditional boosting is that examples labels can switch between iterations as well, which explains that the training sample, St, is indexed by the iteration number. Step 2.3 This step computes the leveraging coefﬁcient αt of the weak classiﬁer ht. It involves a quantity, W 2,t, which we deﬁne as any strictly positive real satisfying

δteti,vpt 1qiu Fp ept 1qiq ˆhtpxiq

ď W 2,t. (8)

For boosting rate s sake, we should ﬁnd W 2,t as small as possible. We refer to (5) for the e., e. notations; v. is the current (set of) offset(s) (Section 4 for their deﬁnition). The second-order Vderivative in the LHS plays the same role as the second-order derivative in classical boosting rates, see for example [45, Appendix, eq. 29]. As offsets Ñ 0, it converges to a second-order derivative; otherwise, they still share some properties, such as the sign for convex functions. Lemma 5.2. Suppose F convex. For any a P R, b, c P R , δtb,cu Fpaq ą 0.

(Proof in Appendix, Section B.3) We can also see a link with weights variation since, modulo a slight abuse of notation, we have δteti,vpt 1qiu Fp ept 1qiq δetiwti. A substantial difference with traditional boosting algorithms is that we have two ways to pick the leveraging coefﬁcient αt; the ﬁrst one can be used when a convenient W 2,t is directly accessible from the loss. Otherwise, there is a simple algorithm that provides parameters (including W 2,t) such that (8) is satisﬁed. Section 5.3

details those two possibilities and their implementation. In the more favorable case (the former one), αt can be chosen in an interval, furthermore deﬁned by ﬂexible parameters εt ą 0, πt P p0, 1q. Note that ﬁxing beforehand these parameters is not mandatory: we can also pick any

αt P ηt ˆ 0, 1 M 2 t W 2,t

and then compute choices for the corresponding εt and πt. εt is important for the algorithm and both parameters are important for the analysis of the boosting rate. From the boosting standpoint, a smaller εt yields a larger αt and a smaller πt reduces the interval of values in which we can pick αt; both cases tend to favor better convergence rates as seen in Theorem 5.3. Step 2.4 is just the crafting of the ﬁnal model. Step 2.5 is new to boosting, the use of a so-called offset oracle, detailed in Section 5.1.2. Step 2.6 The weight update does not rely on a ﬁrst-order oracle as in traditional boosting, but uses only loss values through v-derivatives. The ﬁniteness of F implies the ﬁniteness of weights. Step 2.7 Early stopping happens if all weights are null. While this would never happen with traditional (e.g. strictly convex) losses, some losses that are unusual in the context of boosting can lead to early stopping. A discussion on early stopping and how to avoid it is in Section 6.

5.1.2 The offset oracle, OO

Let us introduce notation

Itipzq . v : Q F p eti, ept 1qi, vq ď z ( , @i P rms, @z ą 0. (10)

(see Figure 3 below to visualize Itipzq for a non-convex F) The offset oracle is used in Step 2.5, which is new to boosting. It requests the offsets to carry out weight update in (7) to an offset oracle, which achieves the following, for iteration #t, example #i, limit OBI z:

OOpt, i, zq returns some v P Itipzq (11)

Note that the offset oracle has the freedom to pick the offset in a whole set. Section 5.4 investigates implementations of the offset oracle, so let us make a few essentially graphical remarks here. OO does not need to build the whole Itipzq to return some v P Itipzq for Step 2.5 in SECBOOST. In the construction steps of Figure 3, as soon as O H, one element of O can be returned. Figure 4 presents more examples of Itipzq. One can remark that the sign of the offset vti in Step 2.5 of SECBOOST is the same as the sign of ept 1qi eti yiαthtpxiq. Hence, unless F is derivable or all edges yihtpxiq are of the same sign (@i), the set of offsets returned in Step 2.5 always contain at least two different offsets, one non-negative and one non-positive (Figure 4, (a-b)).

5.2 Convergence of SECBOOST

The offset oracle has a technical importance for boosting: Itipzq is the set of offsets that limit an OBI for a training example (Deﬁnition 4.6). The importance for boosting comes from Lemma 4.7: upperbounding an OBI implies lowerbounding a Bregman Secant divergence, which will also guarantee a sufﬁcient slack between two successive boosting iterations. This is embedded in a blueprint of a proof technique to show boosting-compliant convergence which is not new, see e.g. [45]. We now detail this convergence.

Remark that the expected edge ηt in Step 2.2 of SECBOOST is not normalized. We deﬁne a normalized version of this edge as:

r 1, 1s Q ηt . ÿ

Wt yti htpxiq

with yti . yi signpwtiq, Wt . ř

i |δvpt 1qi Fp ept 1qiq|. Remark that the labels are corrected by the weight sign and thus may switch between iterations. In the particular case where the loss is non-increasing (such as with traditional convex surrogates), the labels do not switch. We need also a quantity which is, in absolute value, the expected weight:

W 1,t . ˇˇEi rms δvpt 1qi Fp ept 1qiq ˇˇ pwe indeed observe W 1,t |Ei rms rwtis |q. (13)

In classical boosting for convex decreasing losses , weights are non-negative and converge to a minimum (typically 0) as examples get the right class with increasing conﬁdence. Thus, W 1,t can be an indicator of when classiﬁcation becomes "good enough" to stop boosting. In our more general setting, it shall be used in a similar indicator. We are now in a position to show a ﬁrst result about SECBOOST.

Theorem 5.3. Suppose assumption 5.1 holds. Let F0 . Fp S, h0q in SECBOOST and z any real such that Fpz q ď F0. Then we are guaranteed that classiﬁer HT output by SECBOOST satisﬁes Fp S, HT q ď Fpz q when the number of boosting iterations T yields:

W 2 1,tp1 π2 t q

W 2,tp1 εtq η2 t ě 4p F0 Fpz qq, (14)

where parameters εt, πt appear in Step 2.3 of SECBOOST.

(proof in Appendix, Section B.4) We observe the tradeoff between the freedom in picking parameters and convergence guarantee as exposed by (14): to get more freedom in picking the leveraging coefﬁcient αt, we typically need πt large (Step 2.3) and to get more freedom in picking the offset vt 0, we typically need εt large (Step 2.5). However, allowing more freedom in such ways reduces the LHS and thus impairs the guarantee in (14). Therefore, there is a subtle balance between "freedom" of choice and convergence. This balance becomes more clear as boosting compliance formally enters convergence requirement.

Boosting-compliant convergence We characterize convergence in the boosting framework, which shall include the traditional weak learning assumption.

Assumption 5.4. (γ-Weak Learning Assumption, γ-WLA) We assume the following on the weak learner: Dγ ą 0 such that @t ą 0, | ηt| ě γ.

As is usually the case in boosting, the weights are normalized in the weak learning assumption (12). So the minimization "potential" of the loss does not depend on the absolute scale of weight. This is not surprising because the loss is "nice" in classical boosting: a large γ guarantees most examples edges moving to the right of the x-axis after the classiﬁer update which, because the loss is strictly decreasing (exponential loss, logistic loss, etc.), is sufﬁcient to yield a smaller expected loss. In our case it is not true anymore as for example there could be a local bump in the loss that would have it increase after the update. This is not even a pathological example: one may imagine that instead of a single bump the loss jiggles a lot locally. How can we keep boosting operating in such cases ? A sufﬁcient condition takes the form of a second assumption that also integrates weights, ensuring that the variation of weights is locally not too large compared to (unnormalized) weights, which is akin to comparing local ﬁrstand second-order variations of the loss in the differentiable case. We encapsulate this notion in what we call a weight regularity assumption.

Assumption 5.5. (ρ-Weight Regularity Assumption, ρ-WRA) Let ρt . W 2 1,t{W 2,t. We assume there exists ρ ą 0 such that @t ě 1, ρt ą ρ.

In Figure 2 we present a(n overly) simpliﬁed depiction of the cases where W 2,t is large for "not nice" losses, and two workarounds on how to keep it small enough for the WRA to hold. Keep in mind that W 1,t is an expected local variation of the loss (13), (5), so as it goes to zero, boosting converges to a local minimum and it is reasonable to expect that the WRA breaks. Otherwise, there are two strategies that keep W 2,t relatively small enough for WRA to hold: either we pick small enough offsets, which essentially works for most losses but make us converge in general to a local minimum (this is in essence our experimental choice) or we optimize the offset oracle so that it sometimes "passes" local jiggling (Figure 2 (d)). While this eventually requires to tune the weak learner jointly with the offset oracle and ﬁne-tune that latter algorithm on a loss-dependent basis, such a strategy can be used to eventually pass local minima of the loss. To do so, "larger" offsets directly translate into corresponding requests for larger magnitude classiﬁcation for the next weak classiﬁer, for the related examples. We are now in a position to state a simple corollary to Theorem 5.3.

This is an important class of losses since it encompasses the convex surrogates of symmetric proper losses [44, 49]

"nice" loss "not nice" loss

y Htpxq y Ht 1pxq

+Kvhb4a/Fd Bf/k Ly/ONa68/wn/4Xi Knzm Q=</latexit>F

u Tl X9d+be Vf/f+0/+Fv+r/Sk B/j PJ8w8r T/+P/4Xc2LTt Q=</latexit>Bt At

y Htpxq y Ht 1pxq

921p2vfr L1c G6y9WXPX/r T2H2v/uf Zf7/9n+Kvhb4a/Fd Bf/k Ly/ONa68/wn/4Xi Knzm Q=</latexit>F

I8Am Z/5GHNk Uzr31pr Td8er TJ/ndf3+wq1v+l Jznv Kq Pxn U516Uel O6uf1q Fh MOd RU4y Wv Da8+Ieo SY0XKl7An3f AWfvl Hok+70ALDVENwqv Ta8El S6ZPG+6Uk4ar IX2/A+WQl3g OZf Trtb7639j Qf7z Z3e4/3t75su1rwfyv7/xy5V/Wvnlc2V/sq Tla9Xqw MV16ve Cu Tl X9d+be Vf/f+0/+Fv+r/Sk B/j PJ8w8r T/+P/4Xc2LTt Q=</latexit>Bt At

y Htpxq y Ht 1pxq

=</latexit>F

y Htpxq y Ht 1pxq

=</latexit>F

I8Am Z/5GHNk Uzr31pr Td8er TJ/ndf3+wq1v+l Jznv Kq Pxn U516Uel O6uf1q Fh MOd RU4y Wv Da8+Ieo SY0XKl7An3f AWfvl Hok+70ALDVENwqv Ta8El S6ZPG+6Uk4ar IX2/A+WQl3g OZf Trtb7639j Qf7z Z3e4/3t75su1rwfyv7/xy5V/Wvnlc2V/sq Tla9Xqw MV16ve Cu Tl X9d+be Vf/f+0/+Fv+r/Sk B/j PJ8w8r T/+P/4Xc2LTt Q=</latexit>Bt At

(a) W 2,t small (b) W 2,t can be large (c) workaround 1 (d) workaround 2

Figure 2: Simpliﬁed depiction of W 2,t "regimes" (Assumption 5.5). We only plot the components of the v-derivative part in (8): removing index i for readability, we get δtet,vt 1u Fp et 1q p Bt Atq{py Htpxq y Ht 1pxqq with At . δvt 1Fpy Ht 1pxqq wt and Bt . δvt 1Fpy Htpxqq ( wt 1 iff vt 1 vt). If the loss is "nice" like the exponential or logistic losses, we always have a small W 2,t (a). Place a bump in the loss (b-d) and the risk happens that W 2,t is too large for the WRA to hold. Workarounds include two strategies: picking small enough offsets (b) or ﬁt offsets large enough to pass the bump (c). The blue arrow in (d) is discussed in Section 6.

Algorithm 2 SOLVEα(S, w, h)

Input sample S tpxi, yiq, i 1, 2, ..., mu, w P Rm, h : X Ñ R. Step 1 : ﬁnd any a ą 0 such that

|ηpw, hq ηp wpsignpηpw, hqq aq, hq|

|ηpw, hq| ă 1. (16)

Return signpηpw, hqq a.

Corollary 5.6. Suppose assumptions 5.1, 5.5 and 5.4 hold. Let F0 . Fp S, h0q in SECBOOST and z any real such that Fpzq ď F0. If SECBOOST is run for a number T of iterations satisfying

T ě 4p F0 Fpzqq

γ2ρ 1 maxt Pr T s εt

1 maxt Pr T s π2 t , (15)

then Fp S, HT q ď Fpzq.

We remark that the dependency in γ is optimal [4].

5.3 Finding W 2,t

There is lots of freedom in the choice of αt in Step 2.3 of SECBOOST, and even more if we look at (9). This, however, requires access to some bound W 2,t. In the general case, the quantity it upperbounds in (8) also depends on αt because eti . αt yihtpxiq. So unless we can obtain such a "simple" W 2,t that does not depend on αt, (6) and (9) provide a system to solve for αt. W 2,t via properties of F Classical assumptions on loss functions for zeroth-order optimization can provide simple expressions for W 2,t (Table 1). Consider smoothness: we say that F is β-smooth if it is derivable and its derivative satisﬁes the Lipschitz condition |F 1pz1q F 1pzq| ď β|z1 z|, @z, z1 [12]. Notice that this implies the condition on the v-derivative of the derivative: |δv F 1pzq| ď β, @z, v. This also provides a straightforward useful expression for W 2,t.

Lemma 5.7. Suppose that the loss F is β-smooth. Then we can ﬁx W 2,t 2β.

(Proof in Appendix, Section B.5) What the Lemma shows is that a bound on the v-derivative of the derivative implies a bound on order-2 V-derivatives (in the quantity that W 2,t bounds (8)). Such a condition on v-derivatives is thus weaker than a condition on derivatives, and it is strictly weaker if we impose a strictly positive lowerbound on the offset s absolute value, which would be sufﬁcient to characterize the boosting convergence of SECBOOST. A general algorithm for W 2,t If we cannot make any assumption on F, there is a simple way to ﬁrst obtain αt and then W 2,t, from which all other parameters of Step 2.3 can be computed. We ﬁrst need a few deﬁnitions. We ﬁrst generalize the edge notation appearing in Step 2.2:

ηpw, hq . Ei rms rwiyihpxiqs ,

(a) (b) (c) (d)

Figure 3: A simple way to build Itipzq for a discontinuous loss F ( eti ă ept 1qi and z are represented), O being the set of solutions as it is built. We rotate two half-lines, one passing through p eti, Fp etiqq (thick line, p q) and a parallel one translated by z (dashed line) (a). As soon as p q crosses F on any point pz1, Fpz1qq with z eti while the dashed line stays below F, we obtain a candidate offset v for OO, namely v z1 eti. In (b), we obtain an interval of values. We keep on rotating p q, eventually making appear several intervals for the choice of v if F is not convex (c). Finally, when we reach an angle such that the maximal difference between p q and F in r eti, ept 1qis is z (z can be located at an intersection between F and the dashed line), we stop and obtain the full Itipzq (d).

eti e(t 1)i

eti e(t 1)i

e(t 1)i eti

(no solution)

(a) (b) (c)

Figure 4: More examples of ensembles Itipzq (in blue) for the F in Figure 3. (a): Itipzq is the union of two intervals with all candidate offsets non negative. (b): it is a single interval with non-positive offsets. (c): at a discontinuity, if z is smaller than the discontinuity, we have no direct solution for Itipzq for at least one positioning of the edges, but a simple trick bypasses the difﬁculty (see text).

so that ηt . ηpwt, htq. Remind the weight update, wti . δvpt 1qi Fpyi Ht 1pxiqq. We deﬁne a "partial" weight update,

wtipαq . δvpt 1qi Fpαyihtpxiq yi Ht 1pxiqq (17)

(if we were to replace vpt 1qi by vti and let α . αt, then wtipαq would be wpt 1qi, hence the partial weight update). Algorithm 2 presents the simple procedure to ﬁnd αt. Notice that we use w with sole dependency on the prospective leveraging coefﬁcient; we omit for clarity the dependences in the current ensemble (H.), weak classiﬁer (h.) and offsets (v.i) needed to compute (17). Theorem 5.8. Suppose Assumptions 5.1 and 5.4 hold and F is continuous at all abscissae t ept 1qi . yi Ht 1pxiq, i P rmsu. Then there are always solutions to Step 1 of SOLVEα and if we let αt ÐSOLVEα(S, wt, ht) and then compute

W 2,t . ˇˇˇˇEi rms

M 2 t δtαtyihtpxqiq,vpt 1qiu Fp ept 1qiq ˇˇˇˇ ,

then W 2,t satisﬁes (8) and αt satisﬁes (6) for some εt ą 0, πt P p0, 1q.

The proof, in Section B.6, proceeds by reducing condition (9) to (16). The Weak Learning Assumption (5.4) is important for the denominator in the LHS of (16) to be non zero. The continuity assumption at all abscissae is important to have limaÑ0 ηp wtpaq, htq ηt, which ensures the existence of solutions to (16), also easy to ﬁnd, e.g. by a simple dichotomic search starting from an initial guess for a. Note the necessity of being continuous only at abscissae deﬁned by the training sample, which is ﬁnite in size. Hence, if this condition is not satisﬁed but discontinuities of F are of Lebesgue measure 0, it is easy to add an inﬁnitesimal constant to the current weak classiﬁer, ensuring the conditions of Theorem 5.8 and keeping the boosting rates.

5.4 Implementation of the offset oracle

Figure 3 explains how to build graphically Itipzq for a general F. While it is not hard to implement a general procedure following the blueprint (i.e. accepting the loss function as input), it would be far

from achieving computational optimality: a much better choice consists in specializing it to the (set of) loss(es) at hand via hardcoding speciﬁc optimization features of the desired loss(es). This would not prevent "loss oddities" to get absolutely trivial oracles (see Appendix, Section B.7).

6 Discussion

For an efﬁcient implementation, boosting requires speciﬁc design choices to make sure the weak learning assumption stands for as long as necessary; experimentally, it is thus a good idea to adapt the weak learner to build more complex models as iterations increase (e.g. learning deeper trees), keeping Assumption 5.4 valid with its advantage over random guessing parameter γ ą 0. In our more general setting, our algorithm SECBOOST pinpoints two more locations that can make use of speciﬁc design choices to keep assumptions stand for a larger number of iterations.

The ﬁrst is related to handling local minima. When Assumption 5.5 breaks, it means we are close to a local optimum of the loss. One possible way of escaping those local minima is to adapt the offset oracle to output larger offsets (Step 2.5) that get weights computed outside the domain of the local minimum. Such offsets can be used to inform the weak learner of the speciﬁc examples that then need to receive larger magnitude in classiﬁcation, something we have already discussed in Section 5. There is also more: the sign of the weight indicates the polarity of the next edge (et., (5)) needed to decrease the loss in the interval spanned by the last offset. To simplify, suppose a substantial fraction of examples have an edge et. in the vicinity of the blue dotted line in Figure 2 (d) so that the loss value is indicated by the big arrow and suppose their current offset vt 1 so that their weight (positive) signals that to minimize further the loss, the weak learner s next weak classiﬁer has to have a positive edge over these examples. Such is the polarity constraint which essentially comes to satisfy the WLA, but there is a magnitude constraint that comes from the WRA: indeed, if the positive edge is too small so that the loss ends up in the "bump" region, then there is a risk that the WRA breaks because the loss around the bump is quite ﬂat, so the numerator of ρt in Assumption 5.5 can be small. Passing the bump implies escaping the local minimum at which the loss would otherwise be trapped. Section 5.4 has presented a general blueprint for the offset oracle but more speciﬁc implementation designs can be used; some are discussed in the Appendix, Section B.7.

The second is related to handling losses that take on constant values over parts of their domain. To prevent early stopping in Step 2.7 of SECBOOST, one needs wt 1 0. The update rule of wt imposes that the loss must then have non-zero variation for some examples between two successive edges (5). If the loss F is constant, then clearly the algorithm obviously stops without learning anything. If F is piecewise-constant, this constrain the design of the weak learner to make sure that some examples receive a different loss with the new model update H.. As explained in Appendix, Section B.11, this can be efﬁciently addressed by speciﬁc designs on SOLVEα.

In the same way as there is no "1 size ﬁts all" weak learner for all domains in traditional boosting, we expect speciﬁc design choices to be instrumental in better handling speciﬁc losses in our more general setting. Our theory points two locations further work can focus on.

7 Conclusion

Boosting has rapidly moved to an optimization setting involving ﬁrst-order information about the loss optimized, rejoining, in terms of information needed, that of the hugely popular (stochastic) gradient descent. But this was not a formal requirement of the initial setting and in this paper, we show that essentially any loss function can be boosted without this requirement. From this standpoint, our results put boosting in a slightly more favorable light than recent development on zeroth-order optimization since, to get boosting-compliant convergence, we do not need the loss to meet any of the assumptions that those analyses usually rely on. Of course, recent advances in zeroth-order optimization have also achieved substantial design tricks for the implementation of such algorithms, something that undoubtedly needs to be adressed in our case, such as for the efﬁcient optimization of the offset oracle. We leave this as an open problem but provide in Appendix some toy experiments that a straightforward implementation achieves, hinting that SECBOOST can indeed optimize very exotic losses.

[1] A. Akhavan, E. Chzhen, M. Pontil, and A.-B. Tsybakov. A gradient estimator via l1randomization for online zero-order optimization with two point feedback. In Neur IPS*35, 2022.

[2] A. Akhavan, M. Pontil, and A.-B. Tsybakov. Exploiting higher order smoothness in derivativefree optimization and continuous bandits. In Neur IPS*33, 2020.

[3] A. Akhavan, M. Pontil, and A.-B. Tsybakov. Distributed zero-order optimisation under adversarial noise. In Neur IPS*34, 2021.

[4] N. Alon, A. Gonen, E. Hazan, and S. Moran. Boosting simple learners. In STOC 21, 2021.

[5] S.-I. Amari and H. Nagaoka. Methods of Information Geometry. Oxford University Press, 2000.

[6] F. Bach. Learning Theory from First Principles. Course notes, MIT press (to appear), 2023.

[7] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with bregman divergences. In Proc. of the 4th SIAM International Conference on Data Mining, pages 234 245, 2004.

[8] P.-L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3:463 482, 2002.

[9] G. Biau, B. Cadre, and L. Rouvière. Accelerated gradient boosting. Mach. Learn., 108(6):971 992, 2019.

[10] M. Blondel, A.-F. T. Martins, and V. Niculae. Learning with Fenchel-Young losses. J. Mach. Learn. Res., 21:35:1 35:69, 2020.

[11] L. M. Bregman. The relaxation method of ﬁnding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comp. Math. and Math. Phys., 7:200 217, 1967.

[12] S. Bubeck. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn., 8(3-4):231 357, 2015.

[13] P.-S. Bullen. Handbook of means and their inequalities. Kluwer Academic Publishers, 2003.

[14] H. Cai, Y. Lou, D. Mc Kenzie, and W. Yin. A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization. In 38th ICML, pages 1193 1203, 2021.

[15] C. Cartis and L. Roberts. Scalable subspace methods for derivative-free nonlinear least-squares optimization. Math. Prog., 199:461 524, 2023.

[16] S. Cheamanunkul, E. Ettinger, and Y. Freund. Non-convex boosting overcomes random label noise. Co RR, abs/1409.2905, 2014.

[17] L. Chen, J. Xu, and L. Luo. Faster gradient-free algorithms for nonsmooth nonconvex stochastic optimization. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 5219 5233. PMLR, 2023.

[18] X. Chen, S. Liu, K. Xu, X. Li, X. Lin, M. Hong, and D. Cox. ZO-Ada MM: Zeroth-order adaptive momentum method for black-box optimization. In Neur IPS*32, 2019.

[19] X. Chen, Y. Tang, and N. Li. Improve single-point zeroth-order optimization using high-pass and low-pass ﬁlters. In 39th ICML, volume 162 of Proceedings of Machine Learning Research, pages 3603 3620. PMLR, 2022.

[20] S. Cheng, G. Wu, and J. Zhu. On the convergence of prior-guided zeroth-order optimisation algorithms. In Neur IPS*34, 2021.

[21] Z. Cranko and R. Nock. Boosted density estimation remastered. In 36th ICML, pages 1416 1425, 2019.

[22] W. de Vazelhes, H. Zhang, H. Wu, X. Yuan, and B. Gu. Zeroth-order hard-thresholding: Gradient error vs. expansivity. In Neur IPS*35, 2022.

[23] D. Dua and C. Graff. UCI machine learning repository, 2021.

[24] E. Fermi and N. Metropolis. Numerical solutions of a minimum problem. Technical Report TR LA-1492, Los Alamos Scientiﬁc Laboratory of the University of California, 1952.

[25] L. Flokas, E.-V. Vlatakis-Gkaragkounis, and G. Piliouras. Efﬁciently avoiding saddle points with zero order methods: No gradients required. In Neur IPS*32, 2019. [26] H. Gao and H. Huang. Can stochastic zeroth-order frank-wolfe method converge faster for non-convex problems? In 37th ICML, pages 3377 3386, 2020. [27] A. Héliou, M. Martin, P. Mertikopoulos, and T. Rahier. Zeroth-order non-convex learning via hierarchical dual averaging. In 38th ICML, pages 4192 4202, 2021. [28] F. Huang, L. Tao, and S. Chen. Accelerated stochastic gradient-free and projection-free methods. In 37th ICML, pages 4519 4530, 2020. [29] B. Irwin, E. Haber, R. Gal, and A. Ziv. Neural network accelerated implicit ﬁltering: Integrating neural network surrogates with provably convergent derivative free optimization methods. In 40th ICML, volume 202 of Proceedings of Machine Learning Research, pages 14376 14389. PMLR, 2023. [30] V. Kac and P. Cheung. Quantum calculus. Springer, 2002. [31] M. J. Kearns and U. V. Vazirani. An Introduction to Computational Learning Theory. M.I.T. Press, 1994. [32] M.J. Kearns. Thoughts on hypothesis boosting, 1988. ML class project. [33] M.J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. J. Comp. Syst. Sc., 58:109 128, 1999. [34] J. Larson, M. Menickelly, and S.-M. Wild. Derivative-free optimization methods. Acta Numerica, pages 287 404, 2019. [35] Z. Li, P.-Y. Chen, S. Liu, S. Lu, and Y. Xu. Zeroth-order optimization for composite problems with functional constraints. In AAAI 22, pages 7453 7461. AAAI Press, 2022. [36] T. Lin, Z. Zheng, and M.-I. Jordan. Gradient-free methods for deterministic and stochastic nonsmooth nonconvex optimization. In Neur IPS*35, 2022. [37] P.-M. Long and R.-A. Servedio. Random classiﬁcation noise defeats all convex potential boosters. MLJ, 78(3):287 304, 2010. [38] C. Maheshwari, C.-Y. Chiu, E. Mazumdar, S. Shankar Sastry, and L.-J. Ratliff. Zerothorder methods for convex-concave minmax problems: applications to decision-dependent risk minimization. In 25th AISTATS, 2022. [39] Y. Mansour, R. Nock, and R.-C. Williamson. Random classiﬁcation noise does not defeat all convex potential boosters irrespective of model choice. In 40th ICML, 2023. [40] E. Mhanna and M. Assaad. Single point-based distributed zeroth-order optimization with a non-convex stochastic objective function. In 40th ICML, volume 202 of Proceedings of Machine Learning Research, pages 24701 24719. PMLR, 2023. [41] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT Press, 2018. [42] Y. Nesterov and V. Spokoiny. Random gradient-free optimization of convex functions. Foundations of Computational Mathematics, 17:527 566, 2017. [43] F. Nielsen and R. Nock. The Bregman chord divergence. In Geometric Science of Information - 4th International Conference, 2019, pages 299 308, 2019. [44] R. Nock and A. K. Menon. Supervised learning: No loss no cry. In 37th ICML, 2020. [45] R. Nock and R.-C. Williamson. Lossless or quantized boosting with integer arithmetic. In 36th ICML, pages 4829 4838, 2019. [46] N.-E. Pfetsch and Sebastian Pokutta. IPBoost - non-convex boosting via integer programming. In 37th ICML, volume 119, pages 7663 7672, 2020. [47] Y. Qiu, U.-V. Shanbhag, and F. Youseﬁan. Zeroth-order methods for nondifferentiable, nonconvex and hierarchical federated optimization. In Neur IPS*36, 2023. [48] M. Rando, C. Molinari, L. Rosasco, and S. Villa. Structured zeroth-order for non-smooth optimization. In Neur IPS*36, 2023. [49] M.-D. Reid and R.-C. Williamson. Information, divergence and risk for binary experiments. JMLR, 12:731 817, 2011.

[50] Z. Ren, Y. Tang, and N. Li. Escaping saddle points in zeroth-order optimization: the power of two-point estimators. In 40th ICML, volume 202 of Proceedings of Machine Learning Research, pages 28914 28975. PMLR, 2023. [51] A.-K. Sahu, M. Zaheer, and S. Kar. Towards gradient free and projection free stochastic optimization. In 22nd AISTATS, pages 3468 3477, 2019. [52] W. Shi, H. Gao, and B. Gu. Gradient-free method for heavily constrained nonconvex optimization. In 39th ICML, volume 162 of Proceedings of Machine Learning Research, pages 19935 19955. PMLR, 2022. [53] M.-K. Warmuth and S. V. N. Vishwanathan. Tutorial: Survey of boosting from an optimization perspective. In 26th ICML, 2009. [54] T. Werner and P. Ruckdeschel. The column measure and gradient-free gradient boosting, 2019. [55] H. Zhang and B. Gu. Faster gradient-free methods for escaping saddle points. In ICLR 23. Open Review.net, 2023. [56] H. Zhang, H. Xiong, and B. Gu. Zeroth-order negative curvature ﬁnding: Escaping saddle points without gradients. In Neur IPS*35, 2022.

To differentiate with the numberings in the main ﬁle, the numbering of Theorems, etc. is letter-based (A, B, ...).

Table of contents

A quick summary of recent zeroth-order optimization approachess Pg 15

Supplementary material on proofs Pg 15

ãÑ Helper results Pg 15 ãÑ Removing the 0 part in Assumption 5.1 Pg 16 ãÑ Proof of Lemma 5.2 Pg 16 ãÑ Proof of Theorem 5.3 Pg 17 ãÑ Proof of Lemma 5.7 Pg 20 ãÑ Proof of Theorem 5.8 Pg 20 ãÑ Implementation of the offset oracle Pg 21 ãÑ Proof of Lemma B.5 Pg 22 ãÑ Handling discontinuities in the offset oracle to prevent stopping in Step 2.5 of SECBOOSTPg 24 ãÑ A boosting pattern that can "survive" above differentiability Pg 24 ãÑ The case of piecewise constant losses for SOLVEα Pg 26

Supplementary material on algorithms, implementation tricks and a toy experiment Pg 26

F F main reference conv. diff. Lip. smooth Lb diff. ML topic [2] online ML [3] distributed ML [1] online ML [14] alt. GD [15] alt. GD [18] alt. GD [17] alt. GD [19] alt. GD [20] alt. GD [25] saddle pt opt [26] alt. FW [28] alt. FW [22] alt. GD [27] online ML [29] deep ML [35] alt. GD [36] saddle pt opt [38] saddle pt opt [40] distributed ML [48] alt. GD [47] federated ML [50] saddle pt opt [51] alt. FW [52] alt. GD [55] saddle pt opt [56] saddle pt opt

Table 1: Summary of formal assumptions about loss F used to prove algorithms convergence in recent papers on zeroth order optimization, in different ML settings (see text for details). We use "smoothness" as a portmanteau for various conditions on the ě 1 order differentiability condition of F. "conv." = convex, "diff." = differentiable, "Lip." = Lipschitz, "Lb" = lower-bounded, "alt. GD" = general alternative to gradient descent (stochastic or not), "alt. FW" = idem for Frank-Wolfe. Our paper relies on no such assumptions.

A A quick summary of recent zeroth-order optimization approaches

Table 1 summarizes a few dozens of recent approaches that can be related to zeroth-order optimization in various topics of ML. Note that no such approaches focus on boosting.

B Supplementary material on proofs

B.1 Helper results

We now show that the order of the elements of V does not matter to compute the V-derivative as in Deﬁnition 4.2. For any σ P t0, 1un, we let 1σ . ř

Lemma B.1. For any z P R, any n P N and any V . tv1, v2, ..., vnu Ă R,

σPt0,1unp 1qn 1σFpz řn i 1 σiviq śn i 1 vi . (18)

Hence, δVF is invariant to permutations of the elements of V.

Proof. We show the result by induction on the size of V, ﬁrst noting that

δtv1u Fpzq δv1Fpzq . Fpz v1q Fpzq

v1 1 ś1 i 1 vi ÿ

σPt0,1u p 1q1 1σFpz σv1q. (19)

We then assume that (18) holds for Vn . tv1, v2, ..., vnu and show the result for Vn 1 . Vn Ytvn 1u, writing (induction hypothesis used in the second identity): δVn 1Fpzq

. δVn Fpz vn 1q δVn Fpzq

σPt0,1unp 1qn 1σFpz řn i 1 σivi vn 1q ř

σPt0,1unp 1qn 1σFpz řn i 1 σiviq śn 1 i 1 vi

σPt0,1unp 1qn 1σFpz řn i 1 σivi vn 1q ř

σPt0,1unp 1qn 1σ 1Fpz řn i 1 σiviq śn 1 i 1 vi

σ1Pt0,1un 1:σ1 n 1 1p 1qn p1σ1 1q Fpz řn 1 i 1 σ1 iviq

ř σ1Pt0,1un 1:σ1 n 1 0p 1qn 1 1σ1 Fpz řn 1 i 1 σ1 iviq

σ1Pt0,1un 1p 1qn 1 1σ1 Fpz řn 1 i 1 σ1 iviq śn 1 i 1 vi , (20)

as claimed.

We also have the following simple Lemma, which is a direct consequence of Lemma B.1. Lemma B.2. For all z, P R, v, z1 P R , we have δv Fpz z1q δv Fpzq z1 δtz1,vu Fpzq. (21)

Proof. It comes from Lemma B.1 that δtz1,vu Fpzq δtv,z1u Fpzq pδv Fpz z1q δv Fpzqq{z1 (and we reorder terms).

B.2 Removing the 0 part in Assumption 5.1

Because everything needs to be encoded, ﬁniteness is not really an assumption. However, the non-zero assumption may be seen as limiting (unless we are happy to use ﬁrst-order information about the loss (Section 5). There is a simple trick to remove it. Suppose ht zeroes on some training examples. The training sample being ﬁnite, there exists an open neighborhood I in 0 such that h1 t . ht δ does not zero anymore on training examples, for any δ P I. This changes the advantage γ in the WLA (Deﬁnition 5.4) to some γ1 satisfying (we assume δ ą 0 wlog)

γ1 ě γMt Mt δ δ Mt δ

from which it is enough to pick δ ď εγMt{p1 γq to guarantee advantage γ1 ě p1 εqγ. If ε is a constant, this translates in a number of boosting iterations in Corollary 5.6 affected by a constant factor that we can choose as close to 1 as desired.

B.3 Proof of Lemma 5.2

We reformulate

δtb,cu Fpaq 2

Fpa b cq Fpaq

Fpa bq Fpa cq

Figure 5: Left: representation of the difference of averages in (22). Each of the secants p 1q and p 2q can take either the red or black segment. Which one is which depends on the signs of c and b, but the general conﬁguration is always the same. Note that if F is convex, one necessarily sits above the other, which is the crux of the proof of Lemma 5.2. For the sake of illustration, suppose we can analytically have b, c Ñ 0. As c converges to 0 but b remains ą 0, δtb,cu Fpaq becomes proportional to the variation of the average secant midpoint; the then-convergence of b to 0 makes δtb,cu Fpaq converge to the second-order derivative of F at a. Right: in the special case where F is convex, one of the secants always sits above the other.

Both µ1 and µ2 are averages that can be computed from the midpoints of two secants (respectively):

p 1q . rpa c, Fpa cqq, pa b, Fpa bqqs, p 2q . rpa, Fpaqq, pa b c, Fpa b cqqs.

Also, the midpoints of both secants have the same abscissa (and the ordinates are µ1 and µ2), so to study the sign of δtb,cu Fpaq, we can study the position of both secants with respect to each other. F being convex, we show that the abscissae of one secant are included in the abscissae of the other, this being sufﬁcient to give the position of both secants with respect to each other. We distinguish four cases.

Case 1: c ą 0, b ą 0. We have a b c ą maxta b, a cu and a ă minta b, a cu. F being convex, p 2q sits above p 1q. So, µ2 ě µ1 and ﬁnally δtb,cu Fpaq ě 0.

Case 2: c ă 0, b ă 0. We now have a b c ă minta b, a cu while a ą maxta b, a cu, so p 2q sits above p 1q. Again, µ2 ě µ1 and ﬁnally δtb,cu Fpaq ě 0.

Case 3: c ą 0, b ă 0. We have a b ă a and a b ă a b c. Also a c ą maxta b c, au, so this time p 2q sits below p 1q but cb ă 0, so δtb,cu Fpaq ě 0 again.

Case 4: c ă 0, b ą 0. So a c ă a ă a b and a c ă a b c. So a c ă minta, a b cu and a b ą maxta, a cu, so p 2q sits below p 1q. Since cb ă 0, so δtb,cu Fpaq ě 0 again.

B.4 Proof of Theorem 5.3

Let us remind key simpliﬁed notations about edges, @t ě 0:

eti . yi Htpxiq, (23) eti . yi αthtpxiq eti ept 1qi. (24)

For short, we also let:

Q ti . Q F p eti, ept 1qi, vipt 1qq, (25)

ti . δvipt 1q Fp etiq δvipt 1q Fp ept 1qiq, (26)

where Q .. is deﬁned in (4). We also split the computation of the leveraging coefﬁcient αt in SECBOOSTin two parts, the ﬁrst computing a real at as:

at P 1 2p1 εtq M 2 t W 2,t r1 πt, 1 πts , (27)

and then using αt Ð atηt. We now use Lemma 4.7 (main ﬁle) and get

Ei rms SF |vtip eti} ept 1qiq ě Ei D Q pt 1qi ı , @t ě 0. (28)

If we reorganise (28) using the deﬁnition of SF |.p.}.q, we get:

Ei rms Fp ept 1qiq

ď Ei rms r Fp etiqs Ei rms p eti ept 1qiq δvti Fp ept 1qiq Ei rms Q pt 1qi ı

Ei rms r Fp etiqs Ei rms ept 1qi δvti Fp ept 1qiq Ei rms Q pt 1qi ı (29)

Ei rms r Fp etiqs αt 1 Ei rms yiht 1pxiq δvti Fp ept 1qiq Ei rms Q pt 1qi ı (30)

Ei rms r Fp etiqs at 1ηt 1 Ei rms ryiht 1pxiq δvti Fp etiqs

at 1ηt 1 Ei rms yiht 1pxiq pt 1qi Ei rms Q pt 1qi ı (31)

Ei rms r Fp etiqs at 1ηt 1 Ei rms wpt 1qiyiht 1pxiq

at 1ηt 1 Ei rms yiht 1pxiq pt 1qi

Ei rms Q pt 1qi ı

Ei rms r Fp etiqs at 1η2 t 1 at 1ηt 1 Ei rms yiht 1pxiq pt 1qi Ei rms Q pt 1qi ı . (32)

(29) (31) make use of deﬁnitions (24) (twice) and (26) as well as the decomposition of the leveraging coefﬁcient in (27).

Looking at (32), we see that we can have a boosting-compliant decrease of the loss if the two quantities depending on pt 1q. and Q pt 1q. can be made small enough compared to at 1η2 t 1. This is what we investigate.

Bounding the term depending on pt 1q. We use Lemma B.2 with z . eti, z1 . ept 1qi, v . vt, which yields (also using (24) and the assumption that ht 1pxiq 0):

pt 1qi . δvti Fp ept 1qiq δvti Fp etiq

δvti Fp eti ept 1qiq δvti Fp etiq

ept 1qi δtept 1qi,vtiu Fp etiq

yi αt 1ht 1pxiq δtept 1qi,vtiu Fp etiq, (33)

and so we get:

at 1ηt 1 Ei rms yiht 1pxiq pt 1qi

at 1ηt 1 Ei rms αt 1pyiht 1pxiqq2 δtept 1qi,vtiu Fp etiq ı

a2 t 1η2 t 1 Ei rms pht 1pxiqq2 δtept 1qi,vtiu Fp etiq ı

ď a2 t 1η2 t 1M 2 t 1 W 2,t 1. (34)

Bounding the term depending on Q .pt 1q We immediately get from the value picked in argument of It 1 in step 2.5 of SECBOOST, the deﬁnition of Itip.q in (10) and our decomposition αt Ð atηt that Q pt 1qi ď εt 1 a2 t 1η2 t 1M 2 t 1 W 2,t 1, @i P rms, so that:

Ei rms Q pt 1qi ı ď εt 1 a2 t 1η2 t 1M 2 t 1 W 2,t 1. (35)

Finishing up with the proof Suppose that we choose εt 1 ą 0, πt 1 P p0, 1q and at 1 as in (27). We then get from (32), (34), (35) that for any choice of vti in Step 2.5 of SECBOOST,

Ei rms Fp ept 1qiq

ď Ei rms r Fp etiqs at 1η2 t 1 a2 t 1η2 t 1M 2 t 1 W 2,t 1 εt 1 a2 t 1η2 t 1M 2 t 1 W 2,t 1 Ei rms r Fp etiqs at 1η2 t 1 1 at 1 p1 εt 1q M 2 t 1 W 2,t 1

ď Ei rms r Fp etiqs η2 t 1p1 π2 t 1q 4 p1 εt 1q M 2 t 1 W 2,t 1 , (36)

where the last inequality is a consequence of (27). Suppose we pick H0 . h0 P R a constant and v0 ą 0 such that

δv0Fph0q 0. (37)

The ﬁnal classiﬁer HT of SECBOOST satisﬁes:

Ei rms r Fpyi HT pxiqqs ď F0 1

η2 t p1 π2 t q p1 εtq M 2 t W 2,t , (38)

with F0 . Ei rms r Fp ei0qs . Ei rms r Fpyi H0qs Ei rms r Fpyih0qs. If we want Ei rms r Fpyi HT pxiqqs ď Fpz q, assuming wlog Fpz q ď F0, then it sufﬁces to iterate until:

1 π2 t W 2,tp1 εtq η2 t M 2 t ě 4p F0 Fpz qq. (39)

Remind that the edge ηt is not normalized. We have deﬁned a normalized edge,

r 1, 1s Q ηt . ÿ

Wt yti htpxiq

with yti . yi signpwtiq and Wt . ř i |wti| ř i |δvpt 1qi Fp ept 1qiq|. We have the simple relationship between ηt and ηt:

Wt pyi signpwtiqq htpxiq

i wtiyihtpxiq

m Wt Mt ηt, (41)

resulting in (@t ě 1),

η2 t M 2 t η2 t ˆWt

η2 t Ei rms |δvpt 1qi Fp ept 1qiq| 2

ě η2 t ˇˇEi rms δvpt 1qi Fp ept 1qiq ˇˇ 2

η2 t W 2 1,t, (42)

recalling W 1,t . ˇˇEi D δvpt 1qi Fp ept 1qiq ˇˇ. It comes from (42) that a sufﬁcient condition for (39) to hold is:

W 2 1,tp1 π2 t q

W 2,tp1 εtq η2 t ě 4p F0 Fpz qq, (43)

which is the statement of Theorem 5.3.

B.5 Proof of Lemma 5.7

We ﬁrst observe that for any a P R, b, c P R ,

|δtb,cu Fpaq| 1 |bc|

Fpa b cq Fpa cq b F 1pa cq p Fpa bq Fpaq b F 1paqq bp F 1pa cq F 1paqq

|Fpa b cq Fpa cq b F 1pa cq| |p Fpa bq Fpaq b F 1paqq| |bp F 1pa cq F 1paqq|

ď 1 |bc| ˆβ

2 b2 β|bc| β β b2

where we used the β-smoothness of F and twice [12, Lemma 3.4]. We can also make a permutation in the expression of δtb,cu Fpaq and instead write

|δtb,cu Fpaq| 1 |bc|

Fpa b cq Fpa bq c F 1pa bq p Fpa cq Fpaq c F 1paqq cp F 1pa bq F 1paqq

|Fpa b cq Fpa bq c F 1pa bq| |p Fpa cq Fpaq c F 1paqq| |cp F 1pa bq F 1paqq|

ď 1 |bc| ˆβ

2 c2 β|bc| β β c2

We thus have

|δtb,cu Fpaq| ď β β

mint|b|, |c|u a

by the power mean inequality [13, Chapter III, Theorem 2]. Since |htpxiq| ď Mt by deﬁnition, we thus have ˇˇˇˇˇEi rms

δteti,vpt 1qiu Fp ept 1qiq ˆhtpxiq

2ﬀˇˇˇˇˇ ď 2β, (47)

which allows us to ﬁx W 2,t 2β and completes the proof of Lemma 5.7.

Remark B.3. Our result is optimal in the sense that if we make one offset (say b) go to zero, then the ratio in (46) goes to zero and we recover the condition on the v-derivative of the derivative, |δc F 1pzq| ď β.

B.6 Proof of Theorem 5.8

We consider the upperbound::

. ˇˇˇˇEi rms

M 2 t δteti,vpt 1qiu Fp ept 1qiq ˇˇˇˇ

eti ˆFp eti vpt 1qiq Fp etiq

vpt 1qi Fp ept 1qi vpt 1qiq Fp ept 1qiq

ˇˇˇˇ 1 αt Ei rms

yi M 2 t ˆFp eti vpt 1qiq Fp etiq

vpt 1qi Fp ept 1qi vpt 1qiq Fp ept 1qiq

ˇˇˇˇ 1 αt Ei rms

M 2 t δvpt 1qi Fp etiq δvpt 1qi Fp ept 1qiq ˇˇˇˇ (48)

(The last identity uses the fact that yi P t 1, 1u). Remark that we have extracted αt from the denominator but it is still present in the arguments eti. For any classiﬁer h, we introduce notation

ηpw, hq . Ei rms rwiyihpxiqs ,

and so ηt (Step 2.2 in SECBOOST) is also ηpwt, htq, which is guaranteed to be non-zero by the Weak Learning Assumption (5.4). We want, for some εt ą 0, πt P r0, 1q,

αt P ηt 2p1 εtq M 2 t W 2,t r1 πt, 1 πts . (49)

This says that the sign of αt is the same as the sign of ηpwt, htq ηt. Since we know its sign, let us look for its absolute value:

|αt| P |ηt| 2p1 εtq M 2 t W 2,t r1 πt, 1 πts . (50)

From (9) (main ﬁle), we can in fact search αt in the union of all such intervals for εt ą 0, πt P r0, 1q, which amounts to ﬁnd ﬁrst:

|αt| P ˆ 0, |ηt| M 2 t W 2,t

and then ﬁnd any εt ą 0, πt P r0, 1q such that (50) holds. Using (48) and simplifying the external dependency on αt, we then need

0, |ηt| |Ei rms yihtpxiq δvpt 1qi Fpαtyihtpxiq ept 1qiq δvpt 1qi Fp ept 1qiq | l jh n

under the constraint that the sign of αt be the same as that of ηt. But, using notation (17) (main ﬁle), we have

Bpαtq |ηpwt, htq ηp wtpαtq, htq|,

and so to get (51) satisﬁed, it is sufﬁcient that

|ηt ηp wtpαtq, htq|

|ηt| ă 1, (52)

which is Step 1 in SOLVEα. The Weak Learning Assumption (5.4) guarantees that the denominator is 0 so this can always be evaluated. The continuity of F in all ept 1qi guarantees limαtÑ0 ηp wtpαtq, htq ηt, and thus guarantees the existence of solutions to (52) for some |αt| ą 0.

To summarize, ﬁnding αt can be done in two steps, (i) solve

|ηt ηp wtpsignpηtq aq, htq|

for some a ą 0 and (ii) let αt . signpηtq a. This is the output of SOLVEα(S, wt, ht), which ends the proof of Theorem 5.8.

B.7 Implementation of the offset oracle: particular cases

Consider the "spring loss" that we deﬁne, for r.s denoting the nearest integer, as:

FSLpzq . logp1 expp zqq 1 b

1 4 pz rzsq2. (53)

Figure 6 plots this loss, which composes the logistic loss with a "U"-shaped term. This loss would escape all optimization algorithms of Table 1 (Appendix), yet there is a trivial implementation of our offset oracle, as explained in Figure 6:

1. if the interval I deﬁned by ept 1qi and eti contains at least one peak, compute the tangence point (zt) at the closest local "U" that passes through p ept 1qi, Fp ept 1qiqq; then if zt P I then vti Ð zt ept 1qi, else vti Ð eti ept 1qi;

2. otherwise F in I is strictly convex and differentiable: a simple dichotomic search can retrieve a feasible vti (see convex losses below);

Notice that one can alleviate the repetitive dichotomic search by pre-tabulating a feasible v for a set of differences |a b| (a, b belonging to the abscissae of the same "U") decreasing by a ﬁxed factor, choosing vti Ð v of the largest tabulated |a b| no larger than | eti ept 1qi|. Discontinuities discontinuities do not represent issues if the argument z of Itipzq is large enough, as shown from the following simple Lemma.

Lemma B.4. Deﬁne the discontinuity of F as:

discp Fq . max " supz |Fpzq limz Fpzq|, supz |Fpzq limz Fpzq|

For any z ě 0, if discp Fq ď z then Itipzq H, @t ě 1, @i P rms.

Figure 4 (c) shows a case where the discontinuity is larger than z. In this case, an issue eventually happens for computing the next weight happens, only when the current edge is at the discontinuity. We note that as iterations increase and the weak learner ﬁnds it eventually more difﬁcult to return weak hypotheses with η. large enough, the discontinuities may become an issue for SECBOOST to not stop at Step 2.5. Or one can always use a simple trick to avoid stopping and which relies on the leveraging coefﬁcient αt: this is described in the Appendix, Section B.9. The case of convex losses If F is convex (not necessarily differentiable nor strictly convex), there is a simple way to ﬁnd a valid output for the offset oracle, which relies on the following Lemma.

Lemma B.5. Suppose F convex. Then for any z, z1 P R, v 0,

tv ą 0 : Q F pz, z1, vq ru

" v ą 0 : DF

ˆ z Fpz vq Fpzq

(proof in Appendix, Section B.8) By deﬁnition, Itipz1q Ď Itipzq for any z1 ď z, so a simple way to implement the offset oracle s output OOpt, i, zq is, for some 0 ă r ă z, to solve the Bregman identity in the RHS of (55) and then return any relevant v. If F is strictly convex, there is just one choice.

If solving the Bregman identity is tedious but F is strictly convex, there a simple dichotomic search that is guaranteed to ﬁnd a feasible v. It exploits the fact that the abscissa maximizing the difference between any secant of F and F has a simple closed form (see [21, Supplement, Figure 13]) and so the OBI in (1) (Deﬁnition 4.6) has a closed form as well. In this case, it is enough, after taking a ﬁrst non-zero guess for v (either positive or negative), to divide it by a constant ą 1 until the corresponding OBI is no larger than the z in the query OOpt, i, zq.

B.8 Proof of Lemma B.5

F being convex, we ﬁrst want to compute the set

Iz,v,r . tv ą 0 : QF pz, z v, z vq ru, (56)

where r is supposed small enough for Iz,v,r to be non-empty. There is a simple graphical solution to this which, as Figure 7 explains, consists in ﬁnding v solution of

sup t Fpz vq ˆ Fptq ˆFpz vq Fpzq

pz v tq r. (57)

exit> e(t 1)i

a/D38t Hyr8u/Lf9e UV+7Vst8POh8lv/4D5QZj Lg=</latexit> eti z

6Jr</latexit>!

F ( eti, e(t 1)i, v) z

Figure 6: The spring loss in (53) is neither convex, nor Lipschitz or differentiable and has an inﬁnite number of local minima. Yet, an implementation of the offset oracle is trivial as an output for OO can be obtained from the computation of a single tangent point (here, the orange v, see text; best viewed in color).

Figure 7: Computing the OBI QF pz, z v, z vq for F convex, pz, vq being given and v ą 0. We compute the line p tq crossing F at any point t, with slope equal to the secant rpz, Fpzqq, pz v, Fpz vqqs and then the difference between F at z v and this line at z v. We move t so as to maximize this difference. The optimal t (in green) gives the corresponding OBI. In (56) and 58, we are interested in ﬁnding v given this difference, r. We also need to replicate this computation for v ă 0.

The LHS simpliﬁes:

sup t Fpz vq ˆ Fptq ˆFpz vq Fpzq

pz vq Fpzq z Fpz vq

" t Fpz vq Fpzq

pz vq Fpzq z Fpz vq

v F ˆFpz vq Fpzq

Fpzq F ˆFpz vq Fpzq

z Fpz vq Fpzq

ˆ z Fpz vq Fpzq

so we end up with an equivalent but more readable deﬁnition for Iz,v,r:

Iz,v,r " v ą 0 : DF

ˆ z Fpz vq Fpzq

which yields the statement of the Lemma.

B.9 Handling discontinuities in the offset oracle to prevent stopping in Step 2.5

Theorem 5.3 and Lemma 5.6 require to run SECBOOST for as many iterations are required. This implies not early stopping in Step 2.5. Lemma B.4 shows that early stopping can only be triggered by too large local discontinuities at the edges. This is a weak requirement on running SECBOOST, but there exists a weak assumption on the discontinuities of the loss itself that simply prevent any early stopping and does not degrade the boosting rates. The result exploits the freedom in choosing αt in Step 2.3. Lemma B.6. Suppose F is any function deﬁned over R discontinuities of zero Lebesgue measure. Then Corollary 5.6 holds for boosting F with its inequality strict while never triggering early stopping in Step 2.5 of SECBOOST.

Proof. To show that we never trigger stopping in Step 2.5, it is sufﬁcient to show that we can run SECBOOSTwhile ensuring F is continuous in an open neighborhood around all edges yi Htpxiq, @i P rms, @t ě 0 (by letting H0 . h0). Remind that eti . ept 1qi αt ythtpxiq, so changing αt changes all edges. We just have to show that either computing αt ensures such a continuity, or αt can be slightly modiﬁed to do so. We have two ways to compute αt:

1. using a value for W 2,t that represents an "absolute" upperbound in the sense of (8) (e.g. Lemma 5.7) and then compute αt as in Step 2.3 of SECBOOST;

2. using algorithm SOLVEα.

Because of the assumption on F, we can always ensure that F is continuous in an open neighborhood of all edges (the basis of the induction amounts to a straightforward choice for h0). This proves the Lemma for [2.].

If we rely on [1.] and the αt computed leads to some discontinuities, then we have complete control to change αt: any continuous change of εt induces a continuous change in αt and thus a continuous change of all edges as well. So, starting from the initial εt chosen in Step 2.3, we increase it to a value ε t ą εt, which we want to keep as small as possible. We can deﬁne for each i P rms an open set pai, biq which is the interval spanned by the new etipε1 tq using ε1 t P pεt, ε t q. Since there are only ﬁnitely many discontinuities on F, there exists a small ε t ą εt such that

@i P rms, @z P pai, biq, F is continuous on z.

This means that @ε1 t P pεt, ε t q, we end up with a loss without any discontinuities on the new edges. Now comes the reason why we want ε t εt small: we can check that there always exist a small enough ε t ą εt such that for any ε1 t we choose, the boosting rate in Corollary 5.6 is affected by at most 1 additional iteration. Indeed, while we slightly change parameter εt to land all new edges outside of discontinuities of F, we also increase the contribution of the boosting iteration in the RHS of (15) by a quantity δ ą 0 which can be made as small as required hence we can just replace the inequality in (15) by a strict inequality. This proves the statement of the Lemma if we rely on [1.] above.

This completes the proof of Lemma B.6.

B.10 A boosting pattern that can "survive" above differentiability

Suppose F is strictly convex and strictly decreasing as for classical convex surrogates (e.g. logistic loss). Assuming wlog all α. ą 0 and example i has both yihtpxiq ą 0 and yiht 1pxiq ą 0, as long as z is small enough, we are guaranteed that any choice vt 1 P Ipt 1qipzq and vt P Itipzq results in 0 ă wpt 1qi ă wti, which follows the classical boosting pattern that examples receiving the right class by weak hypotheses have their weight decreased (See Figure 8). If z z1 is large enough, then this does not hold anymore as seen from Figure 8.

Figure 8: Case F strictly convex, with two cases of limit OBI z and z1 in I.ip.q. Example i has eti ą 0 and ept 1qi ą 0 (??) large enough (hence, edges with respect to weak classiﬁers ht and ht 1 large enough) so that Itipzq X Ipt 1qipzq Ipt 1qipzq X Ipt 2qipzq Itipzq X Ipt 2qipzq H. In this case, regardless of the offsets chosen by OO, we are guaranteed that its weights satisfy wpt 1qi ă wti ă wpt 1qi, which follows the boosting pattern that examples receiving the right classiﬁcation by weak classiﬁers have their weights decreasing. If however the limit OBI changes from z to a larger z1, this is not guaranteed anymore: in this case, it may be the case that wpt 1qi ą wti.

B.11 The case of piecewise constant losses for SOLVEα

Initialization

Step 2.6, iteration t

Figure 9: How our algorithm works with the 0/1 loss (in red): at the initialization stage, assuming we pick h0 0 for simplicity and some v0 ă 0, all training examples get the same weight, given by negative the slope of the thick blue dashed line. All weights are thus ą 0. At iteration t when we update the weights (Step 2.6), one of two cases can happen on some training example px, yq. In (A), the edge of the strong model remains the same: either both are positive (blue) or both negative (olive green) (the ordering of edges is not important). In this case, regardless of the offset, the new weight will be 0. In (B), both edges have different sign (again, the ordering of edges is not important). In this case, the examples will keep non-zero weight over the next iteration. See text below for details.

Figure 9 schematizes a run of our algorithm when training loss = 0/1 loss. At the initialization, it is easy to get all examples to have non-zero weight. The weight update for example px, yq of our algorithm in Step 2.3 is (negative) the slope of a secant that crosses the loss in two points, both being in between y Ht 1pxq and y Htpxq. Hence, if the predicted label does not change (signp Htpxqq signp Ht 1pxqq), then the next weight (wt 1) of the example will be zero (Figure 9, case (A)). However, if the predicted label does change (signp Htpxqq signp Ht 1pxqq) then the example may get a non-zero weight depending on the offset chosen.

Hence, our generic implementation of Algorithms 3 and 4 may completely fail at providing non-zero weights for the next iteration, which makes the algorithm stop in step 2.7. And even when not all weights are zero, there may be just a too small subset of those, that would break the Weak Learning Assumption for boosting compliance of the next iteration (Assumption 5.5).

C Supplementary material on algorithms, implementation tricks and a toy experiment

C.1 Algorithm and implementation of SOLVEα and how to ﬁnd parameters from Theorem 5.8

As Theorem 5.8 explains, SOLVEα can easily get to not just the leveraging coefﬁcient αt, but also other parameters that are necessary to implement SECBOOST: W 2,t and εt (both used in Step 2.5). We now provide a simple pseudo code on how to implement SOLVEα amnd get, on top of it, the two other parameters. We do not seek πt since it is useful only in the convergence analysis. Also, our proposal implementation is optimized for complexity (because of the geometric updating of δ, W in their respective loops) but much less so for for accuracy. Algorithm SOLVE_extended explains the overall procedure.

Algorithm 3 SOLVE_extended(S, w, h, M)

Input sample S tpxi, yiq, i 1, 2, ..., mu, w P Rm, h : X Ñ R, M 0.

// in our case, w Ð wt; h Ð ht; M Ð Mt (current weights, weak hypothesis and max conﬁdence, see Step 2.3 in SECBOOST and Assumption 5.1) Step 1 : // all initializations

ηinit Ð ηpw, hq; (59) δ Ð 1.0; (60) Winit Ð 1.0; (61)

Step 2 : do // Step 2 computes the leveraging coefﬁcient αt α Ð δ signpηinitq; ηnew Ð ηp wpαq, hq; if |ηnew ηinit| ă |ηinit| then found_alpha Ð true else δ Ð δ{2; while found_alpha false; Step 3 : W Ð Left Hand Side of (8) (main ﬁle) // Step 3 computes W 2,t // we can use (8) (main ﬁle) because we know α if W machine 0 then

// the LHS of (8) is (machine) 0: just need to ﬁnd W such that (9) holds ! W Ð Winit; while |α| ą |ηinit|{p W M 2q do W Ð W{2; endif Step 4 : bsup Ð |ηinit|{p W M 2q; // Step 4 computes εt ε Ð pbsup{αq 1; Return pα, W, εq;

eti e(t 1)i

Figure 10: How to ﬁnd some v P Itipzq: parse the interval r eti, ept 1qis with a regular step δ, seek the secant with minimal slope (because eti ă ept 1qi; otherwise, we would seek the secant with maximal slope). It is necessarily the one minimizing the OBI among all regularly spaced choices. If the OBI is still too large, decrease the step δ and start the search again.

C.2 Algorithm and implementation of the offset oracle

There exists a very simple trick to get some adequate offset v to satisfy (11) (main ﬁle), explained in Figure 10. In short, we seek the optimally bended secant and check that the OBI is no more than a required z. This can be done via parsing the interval r eti, ept 1qis using regularly spaced values. If the OBI is too large, we can start again with a smaller step size. Algorithm OO_simple details the key part of the search.

Algorithm 4 OO_simple(F, et, et 1, z, Z)

Input loss F, two last edges et, et 1, maximal OBI z, precision Z.

// in our case, et Ð eti; et 1 Ð ept 1qi; (for training example index i P rms) Step 1 : // all initializations

δ Ð et 1 et

zc Ð et δ; (63) i Ð 0; (64)

Step 2 : do

sc Ð SLOPEp F, et, zcq;

// returns the slope of the secant passing through p et, Fp etqq and pzc, Fpzcqq if pi 0q _ ppδ ą 0q psc ă s qq _ ppδ ă 0q psc ą s qqq then s Ð sc; z Ð zc endif zc Ð zc δ; i Ð i 1; while pzc etq pzc et 1q ă 0; // checks that zc is still in the interval Return z et; // this is the offset v

Clipped logistic loss, q 2 Spring loss, Q 500

Figure 11: Crops of the two losses whose optimization has been experimentally tested with SECBOOST, in addition to the logistic loss. See text for details.

C.3 A toy experiments

We provide here a few toy experiments using SECBOOST. These are just meant to display that a simple implementation of the algorithm, following the blueprints given above, can indeed manage to optimize various losses. These are not meant to explain how to pick the best hyperparameters (e.g. (60)) nor how to choose the best loss given a domain, a problem that is far beyond the scope of our paper.

In this implementation, the weak learner learns decision trees and we minimize Matushita s loss at the leaves of decision trees to learn ﬁxed size trees, see [33] for the criterion and induction scheme, which is standard for decision trees. SECBOOST is implemented as is given in the paper, and so are the implementation of SOLVEα and the offset oracle provided above. We have made no optimization whatsoever, with one exception: when numerical approximation errors lead to an offset that is machine 0, we replace it by a small random value to prevent the use of derivatives in SECBOOST.

We have investigated three losses. The ﬁrst is the well known logistic loss:

FLOGpzq . logp1 expp zqq. (65)

The other two are tweaks of the logistic loss. We have investigated a clipped version of the logistic loss,

FCL,qpzq . mintlogp1 expp zqq, logp1 expp qqqu, (66)

with q P R, which clips the logistic loss above a certain value. This loss is non-convex and nondifferentiable, but it is Lipschitz. We have also investigated a generalization of the spring loss (main ﬁle):

FSL,Qpzq . logp1 expp zqq 1 b

1 4 pz Q rz Qsq2

with z Q . Qz 1{2 (r.s is the closest integer), which adds to the logistic loss regularly spaced peaks of variable width. This loss is non-convex, non-differentiable, non-Lipschitz. Figure 11 provides a crop of the clipped logistic loss and spring loss we have used in our test. Notice the hardness that the spring loss intuitively represents for ML.

We provide an experiment on public domain UCI tictactoe [23] (using a 10-fold stratiﬁed crossvalidation to estimate test errors). In addition to the three losses, we have crossed them with several other variables: the size of the trees (either they have a single internal node = stumps or at most 20 nodes) and, to give one example of how changing a (key) hyperparameter can change the result, we have tested for a scale of changes on the initial value of δ in (60). Finally, we have crossed all these variables with the existence of symmetric label noise in the training data, following the setup of [37, 39]. We ﬂip each label in the training sample with probability η. Table 12 summarizes the results obtained. One can see that SECBOOST manages to optimize all losses in pretty much all settings, with an eventual early stopping required for the spring loss if δ is too large. Note that the best initial value for δ depends on the loss optimized in these experiments: for δ 0.1, test error from the spring loss decreases much faster than for the other losses, yet we remind that the spring loss is just the logistic loss plus regularly spaced peaks. This could signal interesting avenues for the best possible implementation of SECBOOST, or a further understanding of the best formal ways to ﬁx those paramaters, all of which are out of the scope of this paper.

δ 0.1 δ 1.0 η Stumps Max size = 20 Stumps Max size = 20

Figure 12: Experiments on UCI tictactoe showing estimated test errors after minimizing each of the three losses we consider, with varying training noise level η, max tree size and initial hyperparameter δ value in (60). See text.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope?

Answer: [Yes]

Justiﬁcation: Our paper is a theory paper: all claims are properly formalized and used.

Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reﬂect how much the results can be expected to generalize to other settings. It is ﬁne to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justiﬁcation: The discussion section is devoted to limitations and improvement of our results

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-speciﬁcation, asymptotic approximations only holding locally). The authors should reﬂect on how these assumptions might be violated in practice and what the implications would be. The authors should reﬂect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reﬂect on the factors that inﬂuence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efﬁciency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be speciﬁcally instructed to not penalize honesty concerning limitations.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justiﬁcation: Our paper is a theory paper: all assumptions, statements and proofs provided.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justiﬁcation: Though our paper is a theory paper, we have included in the supplement a detailed statement of all related algorithms and a toy experiment of a simple implementation of these algorithms showcasing a simple run on a public UCI domain.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or veriﬁable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might sufﬁce, or if the contribution is a speciﬁc model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufﬁcient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [No]

Justiﬁcation: Our paper is a theory paper. All algorithms we introduce are either in the main ﬁle or the appendix.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [NA]

Justiﬁcation: Our paper is a theory paper.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Signiﬁcance

Question: Does the paper report error bars suitably and correctly deﬁned or other appropriate information about the statistical signiﬁcance of the experiments?

Answer: [NA]

Justiﬁcation: Our paper is a theory paper.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, conﬁdence intervals, or statistical signiﬁcance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not veriﬁed. For asymmetric distributions, the authors should be careful not to show in tables or ﬁgures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding ﬁgures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufﬁcient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [NA] . Justiﬁcation: Our paper is a theory paper. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justiﬁcation: The research of the paper follows the code of ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justiﬁcation: Our paper is a theory paper. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake proﬁles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact speciﬁc groups), privacy considerations, and security considerations.

The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efﬁciency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justiﬁcation: No release of data or models.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety ﬁlters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justiﬁcation: no outside code, data or models used requiring licensing.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justiﬁcation: No new assets provided. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip ﬁle. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justiﬁcation: No crowdsourcing or research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is ﬁne, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justiﬁcation: No research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary signiﬁcantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.