# being_properly_improper__1ebbcb6c.pdf

Being Properly Improper

Tyler Sypherd 1 Richard Nock 2 Lalitha Sankar 1

Properness for supervised losses stipulates that the loss function shapes the learning algorithm towards the true posterior of the data generating distribution. Unfortunately, data in modern machine learning can be corrupted or twisted in many ways. Hence, optimizing a proper loss function on twisted data could perilously lead the learning algorithm towards the twisted posterior, rather than to the desired clean posterior. Many papers cope with specific twists (e.g., label/feature/adversarial noise), but there is a growing need for a unified and actionable understanding atop properness. Our chief theoretical contribution is a generalization of the properness framework with a notion called twist-properness, which delineates loss functions with the ability to untwist the twisted posterior into the clean posterior. Notably, we show that a nontrivial extension of a loss function called α-loss, which was first introduced in information theory, is twist-proper. We study the twist-proper α-loss under a novel boosting algorithm, called PILBOOST, and provide formal and experimental results for this algorithm. Our overarching practical conclusion is that the twistproper α-loss outperforms the proper log-loss on several variants of twisted data.

1. Introduction

The loss function is a cornerstone of machine learning (ML). The founding theory of properness for supervised losses stipulates that the loss function shapes the learning algorithm towards the true posterior (Reid & Williamson, 2011). Consequently, a model trained with a proper loss function will try to closely approximate the Bayes rule of the data generating distribution. Historically, properness draws its roots

1School of Electrical, Computer and Energy Engineering, Arizona State University; 2Google Research. Correspondence to: Tyler Sypherd <tsypherd@asu.edu>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

from classical work in normative economics for class probability estimation (CPE) (Reid & Williamson, 2011; Savage, 1971; Shuford et al., 1966) and Fisher consistency (Fisher, 1922); some of the most famous losses in supervised learning are proper, e.g., log, square, Matusita (Matusita, 1956), to name a few. Unfortunately, in many modern applications data can be corrupted or twisted in various ways (see Section 2); examples of twists include label noise, adversarial noise, and feature noise. Thus, optimizing a proper loss function on twisted data could perilously lead the learning algorithm towards the Bayes rule of the twisted posterior, rather than to the desired clean posterior. To ensure that a model trained with a proper loss function on twisted data properly generalizes to the clean distribution, a generalization of properness is clearly required.

To this end, we propose the notion of twist-properness. In words, a loss function is twist-proper if and only if (iff), for any twist, there exist hyperparameter(s) of the loss which allow its minimizer to untwist the twisted posterior into the clean posterior. Thus, twist-properness certifies loss functions that allow general posterior corrections, which is analogous to how PAC learning certifies computationally efficient and accurate learning algorithms (Valiant, 1984). This generalization of properness with twist-properness would be less impactful without a solid contender loss, and we show that a nontrivial extension of α-loss, which itself is an information-theoretic hyperparameterization of the logloss (Arimoto, 1971; Liao et al., 2018; Sypherd et al., 2019), is twist-proper and exhibits desirable properties for local and global (namely, fixed hyperparameter) twist corrections. Furthermore, twist-properness is not vacuous as we provide a counterexample that another (popular) generalization of the log-loss, the focal loss (Lin et al., 2017), which was originally designed to solve specific twists, i.e., class imbalance, is not twist-proper. In addition, we provide a proof that a loss which acts as a general wrapper of a loss, the Super Loss (Castells et al., 2020), is also not twist-proper. One of our key takeaways is that twist-properness necessitates a certain nontrivial symmetry of the loss, rather than merely a trivial extension of the hyperparameter(s).

Recently, α-loss was practically implemented in logistic regression and in deep neural networks (Sypherd et al., 2022). In both settings, it was shown to be more robust to symmetric label noise for fixed α > 1 than the proper

Being Properly Improper

log-loss (α = 1), thereby providing a hint at the twistproperness of α-loss. In order to complement our theory of twist-properness and these recent results regarding the robustness of α-loss, we also practically implement α-loss in boosting. Boosting is imbued with the computational constraint that strong learning happens from weak updates in polynomial time, thus inducing substantial convergence rates (Kearns & Vazirani, 1994). Furthermore, boosting algorithms are known to suffer under label noise, particularly for convex losses in low capacity models (Long & Servedio, 2010; Mansour et al., 2022). Thus, boosting presents as an ideal choice to further practically investigate the twistproperness of α-loss.

In order to implement α-loss in boosting, a popular route is to invert the canonical link of the loss which computes the weighting of the examples (Friedman, 2001; Nock & Nielsen, 2008; Nock & Williamson, 2019). While this is feasible for the log-loss (one gets the popular sigmoid function), it turns out to be nontrivial for α-loss. We address this issue by providing the first (to the best of our knowledge) general boosting scheme (called PILBOOST) for any loss which requires only an approximation of the inverse canonical link, depending on a parameter ζ [0, 1] (the closer to 0, the better the approximation), and gives boosting-compliant convergence, further meeting the general optimum number of calls to the weak learner. The cost of this approximation is only a factor O(1/(1 ζ)2) in number of iterations.

In Section 6, we implement PILBOOST with the approximate inverse canonical link of α-loss on several tabular datasets, each suffering from various twists (label, feature, and adversarial noise), and compare against Ada Boost (Freund & Schapire, 1997) and XGBoost (Chen & Guestrin, 2016). In general, we find improved algorithmic robustness to all twists through using simple (fixed) hyperparameter corrections via the α-loss, which aligns with our theoretical contributions (see Section 4).

2. Related Work

Studying data corruption in ML dates back to the 80s (Valiant, 1985). Remarkably, the first twist models assumed very strong corruption, possibly coming from an adversary with unbounded computational resources, but the data at hand was binary. Thus, because the feature space was as complex as the class space, the twist models lacked the unparalelled data complexity that we now face. Obtaining such twist models at scale with real world data has been a major problem in ML over the past decade for a number of reasons. Nevertheless, there have been several streams of recent research aimed at addressing specific twists.

Label noise is a twist which has recently drawn much attention and garnered many corrective attempts (Patrini et al., 2017; Zhang & Sabuncu, 2018; Zhang et al., 2021;

Natarajan et al., 2013; Long & Servedio, 2010; Sypherd et al., 2022; Liu & Guo, 2020; Ghosh et al., 2017). Notably, Natarajan et al. (2013) theoretically study the presence of class conditional noise in binary classification. Their approach consists of augmenting proper loss functions with re-weighting coefficients, which is strictly dependent on the class conditional noise percentages, and hence requires knowledge of the noise proportions. As a byproduct of their analysis, they show that biased SVM and weighted logistic regression are provably noise-tolerant.

Setting label noise aside, there exists a zoo of other twists and corrective attempts. For instance, data augmentation techniques, with vicinal risk minimization standing as a pioneer (Chapelle et al., 2000), seek to induce general robustness (Zhang et al., 2018). In deep learning, adversarial robustness attempts to address the brittleness of neural networks to targeted adversarial noise (Szegedy et al., 2013; Madry et al., 2018; Andriushchenko & Hein, 2019). Data poisoning twists in computer vision can be very sophisticated and require further investigation (Truong et al., 2020). Invariant risk minimization aims at finding data representations yielding good classifiers but also invariant to environment changes (Arjovsky et al., 2019); relatedly, covariate shift seeks to address changes between train and test, stemming from non-stationarity or bias in the data (Zhang et al., 2020). A recent trend has also emerged with correcting losses due to model confidence issues (Guo et al., 2017; Mukhoti et al., 2020; Castells et al., 2020).

Viewed more broadly, the abovementioned papers arguably study much different problems, but they tend to have a theme that goes substantially deeper than the superficial observation that they assume twisted data in some way: the core loss function is usually a proper loss. Therefore, they tend to start from the premise of a loss that inevitably fits the (unwanted) twist, and correct it mostly with a regularizer informed with some prior knowledge of the twist, on a twist-by-twist basis. There has been some positive work in this loss + regularizer direction (Amid et al., 2019; Ma et al., 2020; Zhang et al., 2021), but we note that this does not fully address the underlying issue of properness shaping the learning algorithm towards the twisted posterior.

Lastly, our generalization of properness with twistproperness is partly inspired by recent work by Charoenphakdee et al. (2021), where they theoretically investigate the focal loss. Notably, they show that the focal loss is classification-calibrated, but not strictly proper. From their work, we also gather the implicit notion that hyperparameterized losses that generalize proper losses (e.g., focal loss or α-loss generalizing log-loss), which may represent a next step for loss functions in ML, need to be carefully understood from the standpoint of what their hyperparameterization trades-off from properness.

Being Properly Improper

3. Losses for Class-Probability Estimation

Our setting is that of losses for class probability estimation (CPE) and our notations follow (Reid & Williamson, 2010; 2011). Given a domain of observations X, we wish to learn a classifier h : dom(h) = X that predicts the label Y Y .= { 1, 1} (we assume two classes or labels) associated with every instance of data drawn from X. Traditionally, there are two kinds of outputs sought: one requires Im(h) = [0, 1], in which case h provides an estimate of P[Y = 1|X], which is often called the Bayes posterior. This is the framework of class probability estimation. The other kind of output requires Im(h) = R, but is usually completed by a mapping to [0, 1], e.g., via the softmax in deep learning. A loss for class probability estimation, ℓ: Y [0, 1] R, has the general definition

ℓ(y, u) .= Jy = 1K ℓ1(u) + Jy = 1K ℓ 1(u), (1)

where J K is Iverson s bracket (Knuth, 1992). Functions ℓ1, ℓ 1 are called partial losses, minimally assumed to satisfy dom(ℓ1) = dom(ℓ 1) = [0, 1] and |ℓ1(u)| , |ℓ 1(u)| , u (0, 1) to be useful for ML. Key additional properties of partial losses are: (M) Monotonicity: ℓ1, ℓ 1 are non-increasing and nondecreasing, respectively; (D) Differentiability: ℓ1 and ℓ 1 are differentiable; (S) Symmetry: ℓ1(u) = ℓ 1(1 u), u [0, 1]. Commonly used proper losses such as log, square and Matusita all satisfy the above three assumptions. The pointwise conditional risk of the local guess u [0, 1] with respect to a ground truth v [0, 1] is

L(u, v) .= EY B(v) [ℓ(Y, u)]

= v ℓ1(u) + (1 v) ℓ 1(u), (2)

where B(v) defines a Bernoulli distribution with v.

Properness L(u, v) is the fundamental quantity that allows to distinguish proper losses: a loss is proper iff for any ground truth v [0, 1], L(v, v) = infu L(u, v), and strictly proper iff u = v is the sole minimiser (Reid & Williamson, 2011). The (pointwise) Bayes risk is L(v) .= infu L(u, v).

Surrogate loss Oftentimes, minimization occurs over the reals (e.g., boosting), hence it is useful to employ a surrogate to the 0-1 loss (Bartlett et al., 2006). Nock & Nielsen (2008) showed that the outputs in [0, 1] and R can be related via convex duality of the losses. Let g (z) .= supt{zt g(t)} denote the convex conjugate of g (Boyd & Vandenberghe, 2004). The surrogate F of L is thus given by

F(z) .= ( L) ( z), z R. (3)

For example, picking the log-loss as ℓgives the binary entropy for L and the logistic loss for F (see Appendix A.1 for a derivation). Convex duality implies that predictions

in [0, 1] and R are related via the (canonical) link of the loss, ( L) (Nock & Williamson, 2019) where we use the notation f to denote the derivative of a function f with respect to its argument. In the sequel, we will see that boosting requires inverting the link of the loss, which we show is nontrivial for hyperparameterized losses, such as α-loss. Lastly, we provide summary properties of a CPE loss (not necessarily proper) and its surrogate, monotonicity being of primary importance. Some parts of the following Lemma are known in the literature (e.g., concavity in Agarwal (2014, Lemma 1)), or are folklore.

Lemma 1 ℓCPE loss, L is concave and continuous; F is convex, continuous and non-increasing.

A proof is provided for completeness in Appendix A.2.

4. Twist-Proper Losses

With the classical setting of properness in hand, we now provide fundamental definitions of twists and twist-properness, and study the twist-properness of several hyperparameterized loss functions. When it comes to correcting (or untwisting) twists, one needs a loss with the property that its minimizer in (2) is different from the now twisted value v and recovers the hidden ground truth v.

Bayes tilted estimates We first characterize the minimizers of (2) when the CPE loss is not necessarily proper. We define the set-valued (pointwise) Bayes tilted estimate tℓas

tℓ( v) .= arg inf u [0,1] L(u, v). (4)

Ideally, we would like for v tℓ( v), i.e., the Bayes tilted estimate tℓ( v) untwists (with hyperparameter(s)) the twisted value v and recovers the ground truth v. However, it follows that if the loss is proper, v tℓ( v) and, if strictly proper, tℓ( v) = { v}. This formally highlights the limitation of proper loss functions in twisted settings, namely, the inability of a proper loss to untwist the twisted value because the minimization of the loss is centered on what it perceives to be the ground truth. The following result stipulates the cardinality of the Bayes tilted estimates.

Lemma 2 If the partial losses ℓ1 and ℓ 1 of a given

CPE loss ℓ: Y [0, 1] R satisfy (M), (D), and (S) and are also strictly convex, then |tℓ( v)| = 1 for every v [0, 1].

In Appendix A.3, we provide an extended version of Lemma 2, denoted Lemma 12, where we prove properties of Bayes tilted estimates for when tℓis set-valued (e.g., setvalued monotonicity and symmetry, and analysis of extreme values). As a consequence, we show that strict monotonicity of the partial losses is not sufficient to guarantee that tℓis a singleton; in fact, strict convexity is required as in Lemma 2.

An important class of twists We now adopt more conventional ML notations and instead of a hidden ground truth

Being Properly Improper

v and twisted ground truth v, we use ηc and ηt to denote the clean and twisted posterior probabilities that Y = 1 given X = x, respectively. Further, a twist refers to a general mapping ηc 7 ηt, which could be a consequence of label/feature/adversarial noise. The following delineates a fundamental class of twists, important in the sequel.

Definition 3 A twist ηc 7 ηt is said to be Bayes blunting iff (ηc ηt 1/2) (ηc ηt 1/2).

The term blunting is inspired by adversarial training (Cranko et al., 2019). Intuitively, a Bayes blunting twist does not change the maximum a posteriori guess for the label given the observation, but it does reduce algorithmic confidence in learned posterior estimates, which is particularly damaging in practice where the learning algorithm only has a finite number of (twisted) training examples. Furthermore, Bayes blunting twists capture a very important twist (see Section 2): symmetric label noise (SLN). Under symmetric label noise with flip probability p [0, 1], the twisted posterior ηt is given by ηt = ηc(1 p) + (1 ηc)p (Reid & Williamson, 2010). The following result readily follows via Definition 3 from consideration of p for fixed ηc.

Lemma 4 SLN is Bayes blunting for p < 1/2.

Historically, Reid & Williamson (2010) showed that proper loss functions are not robust to this twist which further motivates our consideration of twist-proper losses.

Twist-proper losses To overcome these limitations of properness, we propose a generalized notion, called twistproperness, which utilizes hyperparameterization of the loss to untwist twisted posteriors into clean posteriors.

Definition 5 A loss ℓis twist-proper (respectively, strictly twist-proper) iff for any twist, there exists hyperparameter(s) such that ηc tℓ(ηt) (respectively, {ηc} = tℓ(ηt)).

Where a proper loss could perilously lead the learning algorithm to estimate ηt, a twist-proper loss employs hyperparameters so that its Bayes tilted estimate recovers ηc, hence guiding the algorithm to untwist the twisted posterior. We emphasize the need for hyperparameters as otherwise, twistproperness would trivially enforce tℓ( ) = [0, 1]. Recently, hyperparameterized loss functions have garnered much interest in ML, to name a few (Barron, 2019; Lin et al., 2017; Amid et al., 2019; Li et al., 2021; Sypherd et al., 2022), possibly because such losses allow practitioners to induce variegated models. Indeed, hyperparameterized loss functions could be efficiently implemented via meta-algorithms, such as Auto ML (He et al., 2021), or practically utilized in the burgeoning field of federated learning (Kairouz et al., 2019), where the hyperparameter(s) might yield more finegrained ML model customization for edge devices.

Ostensibly, optimal hyperparameters requires explicit knowledge of the distribution and twist, and each example in the training set requires a different hyperparameter

to untwist its twisted posterior. However, in the sequel, we show that a twist-proper loss, namely α-loss, with a fixed hyperparameter (α) can untwist a large class of twists, i.e. Bayes blunting twists (such as SLN), better than log-loss. Thus, we posit through our experimental results in Section 6 that the practitioner only needs peripheral, rather than explicit, knowledge of a Bayes blunting twist in the data.

Twist-(im)proper losses Lin et al. (2017) introduced the focal loss to improve class imbalance issues associated with dense object detection. It generalizes the log-loss and has become popular due to its success in such domains. Recently, the focal loss has received increased scrutiny (Charoenphakdee et al., 2021), where it was shown to be classification-calibrated but not strictly proper. Here, we determine the twist-properness of the focal loss.

Lemma 6 Define the focal loss via the following partial losses: ℓFL 1 (u) .= (1 u)γ log u and ℓFL 1(u) .= ℓFL 1 (1 u), with γ 0. Then the focal loss is not twist proper.

In the proof (see Appendix A.4), we also provide a proof that a loss which acts as a general wrapper of a loss, the Super Loss (Castells et al., 2020), is not twist proper. Concerning the focal loss, Lemma 6 is not necessarily an impediment for this loss function, which was designed to deal with a specific twist, class imbalance, and it does not prevent generalizations of the focal loss that would be twist proper. However, our proof suggests that the Bayes tilted estimate (4) of such generalizations risks not being in a simple analytical form. Intuitively, twist-properness requires more than a trivial extension of the hyperparameter of the loss; it also seems to require a certain symmetry, which we observe with the following twist-proper loss, α-loss.

A twist-proper loss The α-loss was first introduced in information theory in the early 70s (Arimoto, 1971) for α R+ and recently received increased scrutiny in privacy and ML (Liao et al., 2018; Sypherd et al., 2019) for α 1. Most recently, Sypherd et al. (2022) studied the calibration, optimization, and generalization characteristics of α-loss in ML for α R+. In particular, they experimentally found that α-loss is robust to noisy labels under logistic regression and convolutional neural-networks for α > 1. We now provide our (extended) definition of the α-loss in CPE.

Definition 7 For α 0, the α-loss has the following partial losses: u [0, 1], ℓα 1 (u) .= ℓα 1(1 u) where

ℓα 1 (u) .= α α 1 1 u α 1

and by continuity we have ℓ0 1(u) .= , ℓ1 1(u) .= log u, and ℓ 1 (u) .= 1 u. For α < 0, we let u [0, 1],

ℓα 1 (u) .= ℓ α 1 (u) = ℓ α 1 (1 u). (6)

For a plot of (5), see Figure 4 in the appendix. Note that the α-loss is (S)ymmetric by construction, and that

Being Properly Improper

it continuously interpolates the log-loss (α = 1) which is proper (Reid & Williamson, 2010). Our definition extends the previous definitions with (6), which induces a fundamental symmetry that is required for twist-properness and is utilized in the following result. For any u [0, 1], we let ι(u) .= log(u/(1 u)) denote the logit of u. Lemma 8 The following four properties, labeled (a)-(d), all hold for α-loss: (a) (M), (D), (S) all hold, α R \ {0}; (b) if (α = 0) (α = ηt = 1/2), then tℓα(ηt) = [0, 1], if α R \ {0, }, then tℓα(ηt) = n ηα t ηαt +(1 ηt)α o , and if α , then tℓ (ηt) = 1 or 1, depending on the sign of ηt 1/2; (c) hence, α-loss is twist-proper with α = ι(ηc)/ι(ηt); (d) for any Bayes blunting twist, α 1.

The proof of Lemma 8 can be found in Appendix A.5. Note that (a) readily follows from Definition 7. The Bayes tilted estimate in (b), i.e. tℓα for α R \ {0, }, is known in the literature as the α-tilted distribution (Arimoto, 1971; Liao et al., 2018; Sypherd et al., 2022). We observe that the α-tilted distribution is symmetric upon permuting (ηt, α) and (1 ηt, α). Hence, our nontrivial extension of the αloss induces a symmetry, particularly useful for untwisting malevolent twists, which thereby yields twist-properness, (c). Lastly, (d) indicates that α 1 for any Bayes blunting twist (e.g., SLN with p < 1/2); however, note that this holds merely for a given x, not over the whole domain X.

Untwisting over the whole domain X Just as classificationcalibration is a pointwise form of consistency (Bartlett et al., 2006), twist-properness is a pointwise form of correction. Extending twist-properness to the entire domain X seems to require learning a mapping α : X [ , ], which is infeasible under standard ML assumptions, since one would need explicit knowledge of the distribution and twist. Nevertheless, we show here that for a large class of twists, namely Bayes blunting twists, a fixed α0 > 1 obtained nonconstructively, is strictly better than the proper choice, log-loss (α = 1). We also provide a general constructive formula for a fixed α R, calculated from distributional and twist information.

In order to represent population quantities, we assume a marginal distribution M over X (following notation by Reid & Williamson (2011)), from which the expected value of a loss ℓquantifies its true risk of a given classifier h. With a slight abuse of notation, we also let ηc, ηt : X [0, 1] denote the clean and twisted posterior mappings, respectively. To evaluate the efficacy of the Bayes tilted estimate of α-loss at untwisting the twisted posterior mapping and recovering the clean posterior mapping, we define the following averaged cross-entropy, given by

CE(ηc, ηt; α) .= EX M[ηc(X) log tℓα(ηt(X))

+ (1 ηc(X)) log tℓα(1 ηt(X))], (7)

where for convenience we used the symmetry property of tℓ

from Appendix A.3, i.e., tℓα(1 ηt(X)) = 1 tℓα(ηt(X)). Following (Schapire & Freund, 2012), we denote the binary entropy as Hb(u) .= u log(u) (1 u) log(1 u), for u [0, 1]. We let H(ηc) represent an averaged binary entropy of the ηc-mapping, given by

H(ηc) .= EX M[Hb(ηc(X))]. (8)

With (7) and (8), we obtain (cf. Thomas & Joy (2006)),

DKL(ηc, ηt; α) .= CE(ηc, ηt; α) H(ηc), (9)

that is, a KL-divergence between the α-Bayes tilted estimate of the twisted posterior and the clean posterior mappings. Intuitively, DKL(ηc, ηt; α) aggregates a series of information-trajectories, strictly dependent on α (either fixed or a mapping), each tracing a path on the probability simplex between the two posterior mappings for every x X. Slightly more restrictive than Definition 3, we define a strictly Bayes blunting twist as a Bayes blunting twist where (ηc < ηt 1/2) (ηc > ηt 1/2); we state one of our main results whose proof is in Appendix A.6. Theorem 9 For any strictly Bayes blunting twist ηc 7 ηt, there exists a fixed α0 > 1 and an optimal α -mapping, α : X R>1, which induces the following ordering

DKL(ηc, ηt; 1) > DKL(ηc, ηt; α0) DKL(ηc, ηt; α ). (10)

This result answers in the affirmative that untwisting X for a large class of twists with a fixed hyperparameter α0 > 1 is strictly better than simply using the proper choice, i.e., α = 1 (log-loss). Specifically, Theorem 9 holds for SLN, which is a strictly Bayes blunting twist for flip probability 0 < p < 1/2 (Lemma 4). The result also states that there exists a mapping α : X R>1 which optimally untwists the strictly Bayes blunting twist; indeed, α can be recovered from Lemma 8(c), i.e., α (x) := ι(ηc(x))/ι(ηt(x)), for every x X. Thus, by the twist-properness of α-loss, DKL(ηc, ηt; α ) = 0 (more details in Appendix A.6). Regarding the search for a fixed α0 > 1 in practice, Sypherd et al. (2022) showed via optimization landscape analysis and experiments on SLN for logistic regression and neuralnetworks that the search space for α0 is bounded (due to saturation), typically α0 [1.1, 8]. In Section 6, we report experimental results for several α; we also incorporate a method inspired by (Menon et al., 2015) to estimate the amount of SLN in training data and thus estimate α0 using Lemma 8(c) as motivated by Theorem 9.

Theorem 9 gave a nonconstructive indication for the optimal regime of α for strictly Bayes blunting twists. Our next result gives a constructive formula for a fixed α for any twist. Given B > 0, let M(B) denote the distribution restricted to the support over X for which we have almost surely

(1 + exp(B)) 1 ηt(x) (1 + exp( B)) 1, (11)

and let p(B) [0, 1] be the weight of this support in M. We

Being Properly Improper

let D(B) denote the product distribution on examples (X Y) induced by marginal M(B) and posterior ηc (see Reid & Williamson (2011, Section 4)). We define the logit-edge as

e .= (1/B) E(X,Y) D(B) [Y ι(ηt(X))] , (12)

where we note that e [ 1, 1] due to the assumption in (11). Finally, we let q .= (1 + e)/2 [0, 1]. Theorem 10 Let B > 0. If p(B) = 1 and we fix α = α

with α .= ι(q)/B, then the following bound holds

DKL(ηc, ηt; α ) Hb(q) H(ηc). (13)

The proof of Theorem 10 is in Appendix A.7, where we also prove an extended version of the result when p(B) < 1. In addition, we provide a simple example where DKL(ηc, ηt; α ) can vanish with respect to DKL(ηc, ηt; 1) (the proper choice). Intuitively, the difference on the righthand-size of (13) in Theorem 10 is reminiscent of a Jensen s gap. Also in the proof of Theorem 10, we find that if |α | is large, there is more flatness in the bounded terms near α . Hence, this suggests that a choice of α0 close-enough to α could yield similar performance.

5. Sideways Boosting a Surrogate Loss

With the theory of twist-properness and the twist-proper α-loss in hand, we now turn towards the algorithmic contribution of this work. As stated in the introduction, α-loss was recently implemented in logistic regression and in deep neural networks (Sypherd et al., 2022), and was found to be more robust to symmetric label noise for fixed α > 1 than the proper log-loss (α = 1). Thus, in order to complement our theory of twist-properness and these recent results of α-loss, we also practically implement α-loss in boosting. Formally, we have a training sample S .= {(xi, yi), i [m]} X Y of m examples, where [m] .= {1, 2, ..., m} and note that Y = { 1, +1}. We write i S to indicate sampling example (xi, yi) according to S. Following (Schapire & Singer, 1999; Collins et al., 2000; Nock & Nielsen, 2008), the boosting algorithm minimizes an expected surrogate loss with respect to S in order to learn a real-valued classifier H : X R given by

j βjhj, (14)

where {h : X R} are WL (weak learning) classifiers with slightly better than random classification accuracy. The oracle WL returns an index j N, and the task for the boosting algorithm is to learn the coordinates of β, initialized to the null vector. In our general framework, the losses we consider are the surrogates F in Lemma 1, essentially convex and non-increasing functions, adding the condition that they are twice differentiable. We compute weights using the blueprint of (Friedman, 2001), which uses the full Hβ,

wi .= F (yi Hβ(xi)), i [m]. (15)

Algorithm 1 PILBOOST

Input sample S, number of iterations T, af > 0, PIL ef; Step 1 : let β 0; // first classifier, H0 = 0 Step 2 : for t = 1, 2, ..., T

Step 2.1 : let wi ef( yi Hβ(xi)), i [m] Step 2.2 : let j WL(S, w) Step 2.3 : let ej (1/m) P

i wiyihj(xi) Step 2.4 : let βj βj + afej Output Hβ.

Via Lemma 1, weights wi are non-negative and tend to increase for an example given the wrong class by the current weak classifier hj, thus, weighting puts emphasis on hard examples. For an underlying CPE loss ℓ, we have that (see Appendix A.8 for a derivation)

F (z) = (ℓ 1 tℓ ℓ1 tℓ) 1( z). (16)

We thus need to invert the difference of the partial losses to recover F . The inversion is easy for the log-loss because of properties of the log function and for the square loss because its partial losses are quadratic functions. However, for hyperparameterized losses, such as the α-loss, the inversion in (16) is nontrivial. We circumvent this difficulty by proposing a novel boosting algorithm, PILBOOST, given in Algorithm 1. Rather than using F as in (15) for the weight update in Step 2.1, PILBOOST uses an approximation function ef, which is non-negative and increasing, that we dub pseudo-inverse link (PIL), which is studied in general in Appendix A.8. Specifically, in Lemma 18, we provide efℓ for α-loss, given in (160). Furthermore in Lemma 19, we show that there exists K > 0 such that, for almost all z R, |( efℓ ( L ) 1)(z)| K/α. We now theoretically analyze PILBOOST, and we make two classical assumptions on

WL (Schapire & Singer, 1999; Nock & Williamson, 2019). Assumption 1 (R) The weak classifiers have bounded range: M > 0 such that |hj(xi)| M, j.

Let ej .= m ej/(1 wj) [ M, M] be the normalized edge of the j-th weak classifier, where with a slight abuse of notation of (12), ej is the (unnormalized) edge (Step 2.3). Assumption 2 (WLA) The weak classifiers are not random: γ > 0 such that | ej| γ M, j.

Note that WLA denotes the Weak Learning Assumption, which is a pillar of boosting theory (cf. (Freund et al., 1999)). Since we employ ef instead of F in PILBOOST, we need two more functional assumptions on the firstand secondorder derivatives of F. The edge discrepancy of a function F on weak classifier hj at iteration t is given by

j(F) .= |Ei S [yihj(xi)F (yi Hβ(xi))] ej| , (17)

which is the absolute difference of the edge using (the derivative of) F vs. using PILBOOST s ef (implicit in ej).

Being Properly Improper

Figure 1: Box and whisker plots reporting the classification accuracy of Ada Boost, PILBOOST (for α {1.1, 2, 4}), and XGBoost on the cancer dataset affected by the class noise twister with 0%, 15%, and 30% twist. Note that the orange line is the median, the green triangle is the mean, the box is the interquartile range, and the circles outside of the whiskers are outliers. All three algorithms were trained with decision stumps (depth 1 regression trees). For α = 1.1, 2, and 4, we set af = 7, 2, and 4, respectively. Numeric values corresponding to the box and whisker plots are provided in Table 2 in Section B.3. We find that PILBOOST has gains over Ada Boost and XGBoost when there is twist present, and α (of our set) increases as the amount of twist increases, which follows theoretical intuition (Lemma 8).

Assumption 3 (E, C) ζ, π [0, 1) such that:

(E) the edge discrepancy is bounded t: j(F) ζ ej, where j is returned by WL at iteration t; (C) the curvature of F is bounded: F .= supz F (z) (1 ζ)(1 + π)/(af M 2).

Note that (C) is quite mild for specific sets of functions, e.g., proper canonical losses are Lipschitz (Reid & Williamson, 2010), so (C) can in general be ensured by a simple renormalization of the loss. On the other hand, (E) can become progressively harder to ensure as the number of iterations increases because the choices of the WL will become restricted; nevertheless, it is not prohibitory in practice as our experiments in Section 6 suggest (also see the remark in Appendix A.10 for further commentary on this assumption). Let wt .= 1 wt, the total weight at iteration t in PILBOOST.

Theorem 11 Suppose (R, WLA) hold on WL and (E, C) hold on F, for each iteration of PILBOOST. Denote Q(F) .= 2F /(γ2(1 ζ)2(1 π2)). The following holds:

on the risk defined by F: z R, T > 0, if we observe PT t=0 w2 t Q(F) (F(0) F(z )), then

Ei S [F(yi Hβ(xi))] F(z ). (18)

on edge distribution: θ 0, ε [0, 1], T > 0, letting Fε,θ .= (1 ε) inf F + εF(θ), if the number of iterations satisfiees T 1 ε2 Q(F ) (F (0) Fε,θ)

e f 2( θ) , then

Pi S [yi Hβ(xi) θ] ε. (19)

Thus, Theorem 11 gives boosting compliant convergence on training, and the synthesis of (18) and (19) provides a very strong convergence guarantee. When classical assumptions about the loss of interest are satisfied, such as it being Lipschitz (ensured for proper canonical losses (Reid

& Williamson, 2010)), there is a natural extension to generalization following standard approaches (Bartlett & Mendelson, 2002; Schapire et al., 1998). See Appendix A.10 for the proof of Theorem 11, and for additional remarks regarding its optimality and further application to addressing discrepancies due to machine type approximations.

6. Experiments

We provide experimental results on PILBOOST (for α {1.1, 2, 4}) and compare with Ada Boost (Freund & Schapire, 1997) and XGBoost (Chen & Guestrin, 2016) on four binary classification datasets, namely, cancer (Wolberg et al., 1995), xd6 (Buntine & Niblett, 1992), diabetes (Smith et al., 1988), and online shoppers intention (Sakar et al., 2019). We performed 10 runs per algorithm with randomization over the train/test split and the twisters. All experiments use regression decision trees (of varying depths 1-3) in order to align with XGBoost. Hyperparameters of XGBoost were kept to default to maintain the fairest comparison between the three algorithms; for more of these experimental details, please refer to Appendix B.5. In order to demonstrate the twist-properness of α-loss as implemented in PILBOOST, we corrupt the training examples of these datasets according to three different (malicious) twisters.

Class Noise Twister (all datasets): This twister is equivalent to SLN in the training sample. Results on this twister for the cancer dataset are presented in Figure 1 and see Appendix B.3 for further results. In general, we find that PILBOOST is more robust to the Class Noise Twister than Ada Boost and XGBoost, and we find that α increases as the amount of twist increases, which complies with our theory (Lemma 8 and Theorem 9). We also present an adaptive α experiment in Figure 2. We denote the adaptive method Menon PILBOOST, since we take inspiration from (Menon et al., 2015), where they show that one can estimate the level

Being Properly Improper

Dataset Algorithm Feature Noise Twister p = 0 0.15 0.25 0.5 Ada Boost 1.000 0.000 0.988 0.013 0.966 0.013 0.884 0.019 us (α = 1.1) 1.000 0.000 0.998 0.004 0.994 0.006 0.905 0.020 us (α = 2.0) 1.000 0.000 1.000 0.000 1.000 0.000 0.910 0.026 xd6 us (α = 4.0) 1.000 0.000 0.997 0.006 0.999 0.002 0.958 0.017 XGBoost 1.000 0.000 0.970 0.016 0.962 0.009 0.833 0.027

Table 1: Accuracies of Ada Boost, PILBOOST (for α {1.1, 2, 4}), and XGBoost on the xd6 dataset affected by the feature noise twister with the flipping probability p = {0, 0.15, 0.25, 0.5}. All three algorithms were trained with depth 3 regression trees. For each value of α, we set af = 8. Note that the xd6 dataset is perfectly classified (when there is no twist) by a Boolean formula on the features, given in (Buntine & Niblett, 1992), which explains the performance when p = 0.

Figure 2: Adaptive α experiment on the xd6 dataset with depth 3 regression trees. Solid curves correspond to mean classification accuracy and shaded areas are the associated 95% confidence intervals obtained from a t-test. For each label noise value, we train three algorithms: 1) vanilla XGBoost; 2) PILBOOST with fixed α = 1.1; 3) and, an adaptive α PILBOOST (we refer to as Menon PILBOOST). For details regarding Menon PILBOOST, refer to Class Noise Twister in the main body. The result suggests that a fixed value of α = 1.1 in PILBOOST is good, but approximating α0 does induce slightly better model performance. For general twists, we suggest this heuristic (or some variant) as inspired by (Menon et al., 2015) could be used to learn α0. Further experimental consideration is given in Appendix B.6.

of label noise (see their Appendix D.1) from the minimum and maximum posterior values. Using a single decision tree classifier with O(log(m)) leaves and O( m) samples per leaf (m 681 examples for xd6 dataset with 70/30 train/test-split), and information gain as the splitting criterion, we estimate the minimum and maximum posterior values directly from the training data with local counts of number of samples classified such that Y = 1 at each leaf. Once we obtain ηmin and ηmax in this way, we estimate the symmetric noise value p [0, 1] with the geometric mean p = p

ηmin(1 ηmax). Finally, to estimate α0 for each noise

level, we apply the formula in Lemma 8(c) and the SLN formula given just before Lemma 4 where we estimate ηc with the average posterior from the decision tree classifier. Further experimental consideration is given in Appendix B.6.

Feature Noise Twister (xd6 dataset): This twister perturbs the training sample by randomly flipping features. More precisely, for each training example, the example is selected if Ber(p1) returns 1. Then, for each selected training example, and for each feature independently, the feature is flipped (the features of xd6 are Booleans) to the other symbol if Ber(p2) also returns 1. Results on this twister are presented in Table 1 where p1 = p2 = p. In general, we find that PILBOOST is more robust to the Feature Noise Twister than Ada Boost and XGBoost, and we find that α increases as the amount of twist increases.

Insider Twister (online shoppers intention dataset): This twister assumes more knowledge about the model than the previous two twisters. In essence, the insider twister adds noise to a few of the most informative features for predicting the class. Specifically for the online shoppers intention dataset, the insider twister adds noise to feature 8 (page values - numeric type with range in [ 250, 435]), feature 10 (month), and feature 15 (visitor type - ternary alphabet). For page values, the insider twister adds i.i.d. N(0, 60) to the entries; for both month and visitor type, the insider twister independently increments (with probability 1/2) the symbol according to their respective alphabets such that about 50% of each of these features are perturbed. Results on this twister are presented in Figure 3 and further discussion in Appendix B.4 (Figure 13); post-twister, the feature importance profile of XGBoost is almost uniform, displaying damages to the algorithm s discriminative abilities (Figure 3, right), while the feature importance profile of PILBOOST is much less perturbed overall.

7. Conclusion

In this work, we have introduced a generalization of the properness framework via the notion of twist-properness,

Being Properly Improper

Figure 3: Normalized feature importance profiles for PILBOOST with α = 1.1 and af = 7 (left) and for XGBoost (right) on the online shoppers intention dataset (both for depth 3 trees) with and without the insider twister. We find that the insider twister significantly perturbs the feature importance of XGBoost as evidenced in the plot (far right), and hence significantly reduces the inferential capacity of the learned model. More details can be found in Insider Twister (main body) and Appendix B.4.

which allows delineating loss functions with the ability to untwist the twisted posterior into the clean posterior. Notably, we have shown in Lemma 8 that a nontrivial extension of a loss function called α-loss, first introduced in information theory, is twist-proper. For a large class of twists (Definition 3), we have shown in Theorem 9 that while a pointwise estimation of α is optimal, a fixed choice of α0 > 1 readily outperforms the proper choice (log-loss). We then studied the twist-proper α-loss under a novel boosting algorithm, called PILBOOST, which uses the pseudo-inverse link for weight updates, for which we proved convergence guarantees in Theorem 11. Lastly, we have presented experimental results for PILBOOST optimizing α-loss on several twists, and also a method inspired by Menon et al. (2015) to estimate α0 using only training data in Figure 2. A possible next step for this line of work would be to design losses that are both twist-proper and algorithmically convenient.

8. Acknowledgments

We thank the anonymous reviewers for their comments and suggestions. This work is supported in part by NSF grants CIF-1901243, CIF-1815361, CIF-2007688, CIF-2134256, Sa TC-2031799, and a Google AI for Social Good grant.

Agarwal, S. Surrogate regret bounds for bipartite ranking via strongly proper losses. JMLR, 15(1):1653 1674, 2014.

Alon, N., Gonen, A., Hazan, E., and Moran, S. Boosting simple learners. In STOC 21, 2021.

Amid, E., Warmuth, M.-K., Anil, R., and Koren, T. Robust bi-tempered logistic loss based on bregman divergences. In Neur IPS*32, pp. 14987 14996, 2019.

Andriushchenko, M. and Hein, M. Provably robust boosted decision stumps and trees against adversarial attacks. In Neur IPS*32, 2019.

Arimoto, S. Information-theoretical considerations on estimation problems. Information and control, 19:181 194, 1971.

Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. Invariant risk minimization. Co RR, abs/1907.02893, 2019.

Barron, J. T. A general and adaptive robust loss function. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4331 4339, 2019.

Bartlett, P., Jordan, M., and Mc Auliffe, J. D. Convexity, classification, and risk bounds. J. of the Am. Stat. Assoc., 101:138 156, 2006.

Bartlett, P.-L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3:463 482, 2002.

Boyd, S. and Vandenberghe, L. Convex optimization. Cambridge University Press, 2004.

Bun, M., Carmosino, M.-L., and Sorrell, J. Efficient, noisetolerant, and private learning via boosting. In 33 th COLT, Proceedings of Machine Learning Research, pp. 1031 1077. PMLR, 2020.

Buntine, W. and Niblett, T. A further comparison of splitting rules for decision-tree induction. Machine Learning, 8(1): 75 85, 1992. URL https://www.openml.org/d/ 40693.

Castells, T., Weinzaepfel, P., and Revaud, J. Super Loss: A generic loss for robust curriculum learning. In Neur IPS*33, 2020.

Chapelle, O., Weston, J., Bottou, L., and Vapnik, V. Vicinal risk minimization. In Advances in Neural Information Processing Systems*13, 2000.

Being Properly Improper

Charoenphakdee, N., Vongkulbhisal, J., Chairatanakul, N., and Sugiyama, M. On focal loss for class-posterior probability estimation: A theoretical perspective. In 34th IEEE CVPR, pp. 5202 5211, 2021.

Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785 794, 2016.

Collins, M., Schapire, R., and Singer, Y. Logistic regression, adaboost and Bregman distances. In Proc. of the 13 th

International Conference on Computational Learning Theory, pp. 158 169, 2000.

Cranko, Z., Menon, A.-K., Nock, R., Ong, C. S., Shi, Z., and Walder, C.-J. Monge blunts Bayes: Hardness results for adversarial training. In 36th ICML, pp. 1406 1415, 2019.

Fisher, R. A. On the mathematical foundations of theoretical statistics. Philosophical transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, 222(594-604):309 368, 1922.

Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119 139, 1997.

Freund, Y., Schapire, R., and Abe, N. A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence, 14(771-780):1612, 1999.

Friedman, J., Hastie, T., and Tibshirani, R. Additive Logistic Regression : a Statistical View of Boosting. Ann. of Stat., 28:337 374, 2000.

Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. of Stat., 29:1189 1232, 2001.

Ghosh, A., Kumar, H., and Sastry, P. S. Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K.-Q. On calibration of modern neural networks. In 34th ICML, pp. 1321 1330, 2017.

He, X., Zhao, K., and Chu, X. Automl: A survey of the state-of-the-art. Knowledge-Based Systems, 212:106622, 2021.

Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. ar Xiv preprint ar Xiv:1912.04977, 2019.

Kearns, M. J. and Vazirani, U. V. An Introduction to Computational Learning Theory. M.I.T. Press, 1994.

Knuth, D.-E. Two notes on notation. The American Mathematical Monthly, 99(5):403 422, 1992.

Li, T., Beirami, A., Sanjabi, M., and Smith, V. On tilted losses in machine learning: Theory and applications. ar Xiv preprint ar Xiv:2109.06141, 2021.

Liao, J., Kosut, O., Sankar, L., and du Pin Calmon, F. A tunable measure for information leakage. In 2018 IEEE International Symposium on Information Theory, ISIT 2018, pp. 701 705. IEEE, 2018.

Lin, T., Goyal, P., Girshick, R.-B., He, K., and Dollár, P. Focal loss for dense object detection. In Proc. of the 23 rd

IEEE ICCV, pp. 2999 3007, 2017.

Liu, Y. and Guo, H. Peer loss functions: Learning from noisy labels without knowing noise rates. In International Conference on Machine Learning, pp. 6226 6236. PMLR, 2020.

Long, P. M. and Servedio, R. A. Random classification noise defeats all convex potential boosters. Machine learning, 78(3):287 304, 2010.

Ma, X., Huang, H., Wang, Y., Romano, S., Erfani, S., and Bailey, J. Normalized loss functions for deep learning with noisy labels. In International Conference on Machine Learning, pp. 6543 6553. PMLR, 2020.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In 6th ICLR, 2018.

Mansour, Y., Nock, R., and Williamson, R. C. What killed the convex booster ?, 2022. URL https://arxiv. org/abs/2205.09628.

Matusita, K. Decision rule, based on the distance, for the classification problem. Annals of the institute of statistical mathematics, 8(2):67 77, 1956.

Menon, A. K., van Rooyen, B., Ong, C.-S., and Williamson, R.-C. Learning from corrupted labels via class probability estimation. In 32nd ICML, pp. 125 134, 2015.

Mukhoti, J., Kulharia, V., Sanyal, A., Golodetz, S., Torr, P.-H.-S., and Dokania, P.-K. Calibrating deep neural networks using focal loss. In Neur IPS*33, 2020.

Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari, A. Learning with noisy labels. Advances in neural information processing systems, 26:1196 1204, 2013.

Nock, R. and Nielsen, F. On the efficient minimization of classification-calibrated surrogates. In NIPS*21, pp. 1201 1208, 2008.

Being Properly Improper

Nock, R. and Williamson, R.-C. Lossless or quantized boosting with integer arithmetic. In 36th ICML, pp. 4829 4838, 2019.

Patrini, G., Rozza, A., Menon, A.-K., Nock, R., and Qu, L. Making deep neural networks robust to label noise: A loss correction approach. In CVPR17, pp. 2233 2241, 2017.

Reid, M.-D. and Williamson, R.-C. Composite binary losses. JMLR, 11:2387 2422, 2010.

Reid, M.-D. and Williamson, R.-C. Information, divergence and risk for binary experiments. JMLR, 12:731 817, 2011.

Sakar, C. O., Polat, S. O., Katircioglu, M., and Kastro, Y. Real-time prediction of online shoppers purchasing intention using multilayer perceptron and lstm recurrent neural networks. Neural Computing and Applications, 31 (10):6893 6908, 2019.

Savage, L.-J. Elicitation of personal probabilities and expectations. J. of the Am. Stat. Assoc., pp. 783 801, 1971.

Schapire, R.-E. and Freund, Y. Boosting, Foundations and Algorithms. MIT Press, 2012.

Schapire, R. E. and Singer, Y. Improved boosting algorithms using confidence-rated predictions. MLJ, 37:297 336, 1999.

Schapire, R. E., Freund, Y., Bartlett, P., and Lee, W. S. Boosting the margin : a new explanation for the effectiveness of voting methods. Annals of statistics, 26:1651 1686, 1998.

Shuford, E., Albert, A., and Massengil, H.-E. Admissible probability measurement procedures. Psychometrika, pp. 125 145, 1966.

Smith, J. W., Everhart, J. E., Dickson, W., Knowler, W. C., and Johannes, R. S. Using the adap learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the annual symposium on computer application in medical care, pp. 261. American Medical Informatics Association, 1988. URL https://www.kaggle.com/ uciml/pima-indians-diabetes-database.

Sypherd, T., Diaz, M., Sankar, L., and Kairouz, P. A tunable loss function for binary classification. In IEEE International Symposium on Information Theory, ISIT 2019, pp. 2479 2483. IEEE, 2019.

Sypherd, T., Diaz, M., Cava, J. K., Dasarathy, G., Kairouz, P., and Sankar, L. A tunable loss function for robust classification: Calibration, landscape, and generalization. IEEE Transactions on Information Theory, pp. 1 1, 2022. doi: 10.1109/TIT.2022.3169440.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. Co RR, abs/1312.6199, 2013.

Thekumparampil, K.-K., Jain, P., Netrapalli, P., and Oh, S. Projection efficient subgradient method and optimal nonsmooth Frank-Wolfe method. In Neur IPS*33, 2020.

Thomas, M. and Joy, A. T. Elements of information theory. Wiley-Interscience, 2006.

Truong, L., Jones, C., Hutchinson, B., August, A., Praggastis, B., Jasper, R., Nichols, N., and Tuor, A. Systematic evaluation of backdoor data poisoning attacks on image classifiers. In CVPR 20, pp. 3422 3431, 2020.

Valiant, L. G. A theory of the learnable. Communications of the ACM, 27:1134 1142, 1984.

Valiant, L. G. Learning disjunctions of conjunctions. In Proc. of the 9 th International Joint Conference on Artificial Intelligence, pp. 560 566, 1985.

Wolberg, D. W., Street, N., and Mangasarian, O. Breast cancer wisconsin (diagnostic) data set, 1995. URL https: //archive.ics.uci.edu/ml/datasets/ Breast+Cancer+Wisconsin+(Diagnostic).

Zhang, H., Cisse, M., Dauphin, Y.-D., and Lopez-Paz, D. mixup: beyond empirical risk minimization. In 6th ICLR, 2018.

Zhang, T., Yamane, I., Lu, N., and Sugiyama, M. A one-step approach to covariate shift adaptation. Co RR, abs/2007.04043, 2020.

Zhang, Y., Niu, G., and Sugiyama, M. Learning noise transition matrix from only noisy labels via total variation regularization. In 38th ICML, pp. 12501 12512, 2021.

Zhang, Z. and Sabuncu, M.-R. Generalized cross entropy loss for training deep neural networks with noisy labels. In Neur IPS*31, 2018.

Appendices for Being Properly Improper

A. Proofs, Further Theoretical Results, and Additional Commentary

A.1. Illustration of Proper Loss to Surrogate through the Convex Conjugate

In this subsection, we provide a worked-out example for how picking the log-loss as ℓgives the binary entropy for L and the logistic loss for F.

From (Reid & Williamson, 2010), we have that the log-loss has partial losses ℓ1(u) = log u, ℓ 1(u) = log (1 u) and is a proper loss. In order to compute the (pointwise) Bayes risk L for the log-loss, we first obtain from (2),

L(u, v) = v ℓ1(u) + (1 v) ℓ 1(u) = v log u + (1 v) log (1 u). (20)

Recall that L(v) .= infu L(u, v). In (20), taking the derivative with respect to u and setting the expression equal to zero, i.e., d du L(u, v) = 0, and solving for u, obtains that u = v, in other words, the log-loss is indeed proper. Plugging u = v back into (20), we find that the pointwise Bayes risk of the log-loss is

L(v) = v log v (1 v) log (1 v), (21)

which is indeed the binary (Shannon) entropy (Thomas & Joy, 2006). Finally, to obtain the logistic loss as the surrogate, we compute the convex conjugate of (21). Formally, we have from (3) that z R,

F(z) = ( L) ( z) = (v log v + (1 v) log (1 v)) ( z). (22)

Indeed, we have that z R,

(v log v + (1 v) log (1 v)) = sup v {z v v log v (1 v) log (1 v)}, (23)

which is similarly obtained by setting the derivative equal to zero and solving, i.e.,

d dv [z v v log v (1 v) log (1 v)] = 0 (24)

z + log (1 v) log v = 0 (25)

z = log v 1 v

v = 1 1 + e z , (27)

which is obtained after a few steps of algebra. Plugging (27) back into (23), we obtain that

sup v {z v v log v (1 v) log (1 v)} = z 1 + e z + log (1 + e z)

1 + e z + log (1 + ez)

1 + ez . (28)

Noticing that log (1 + e z) = log ((e z) (1 + ez)), we have from (28) that

sup v {z v v log v (1 v) log (1 v)} = log (1 + ez). (29)

Plugging this back into (22), we have that for the log-loss, the surrogate is given by z R,

F(z) = ( L) ( z) = log (1 + e z), (30)

which is indeed the logistic loss (cf. (Bartlett et al., 2006)), useful in the margin setting.

Being Properly Improper

A.2. Proof of Lemma 1

We study U .= ( L) , which is convex by definition, and show that it is non-decreasing. Monotonicity follows from the non-negativity of the argument of the partial losses and the definition of the convex conjugate: suppose z z and let u arg supu zu + L(u). We have

U(z ) .= sup u [0,1] z u + L(u) (31)

= sup u [0,1] (z z)u + zu + L(u) (32)

(z z)u + zu + L(u ) (33)

= (z z)u + U(z) (34)

which completes the proof that U is non-decreasing and therefore F(z) .= U( z) non-increasing.

Concavity of L follows from definition. We show continuity of L, the continuity of F then following from the definition of the convex conjugate F (Boyd & Vandenberghe, 2004). Let a, u (0, 1), let u tℓ(u), a tℓ(a). We get:

L(u) .= uℓ1(u ) + (1 u)ℓ 1(u ) (36)

uℓ1(a ) + (1 u)ℓ 1(a ) (37)

= L(a) + (u a)(ℓ1(a ) ℓ 1(a )), (38)

(the inequality holds since otherwise u tℓ(u)) Permuting the roles of u and a, we also get

L(a) L(u) + (a u)(ℓ1(u ) ℓ 1(u )), (39)

from which we get

|L(a) L(u)| Z |a u|, (40)

with Z .= maxv {a,u} sup |ℓ1(tℓ(v)) ℓ 1(tℓ(v))| (where we use set differences if tℓs are not singletons). Since Z , (40) is enough to show the continuity of L (we have by assumption dom(L) = [0, 1]).

A.3. Bayes Tilted Estimates

The proof of Lemma 2 readily follows from Definition 4 and standard properties of convex functions, see e.g., (Boyd & Vandenberghe, 2004).

Below, we also provide analysis of the properties of Bayes tilted estimates for more general losses which induce set-valued functions. Following convention, we denote the set valued inequality A B, such that, a A, b B, a b and the set-valued (Minkowski) difference A B .= {a b : a A, b B}.

Lemma 12 The following properties of tℓfollow from assumptions M, D or S on partial losses: (M) implies set-valued monotonicity: u1 < u3 [0, 1], we have tℓ(u1) tℓ(u3) and tℓ(u1) tℓ(u3) tℓ(u2), u2 (u1, u3); (D) and tℓdifferentiable imply monotonicity: u [0, 1], ℓ 1(tℓ(u)) ℓ 1(tℓ(u)) t ℓ(u) 0; (S) implies set-valued symmetry: tℓ(1 u) = {1} tℓ(u), u [0, 1]; (E) Extreme values: ℓ1(1) = ℓ 1(0) = 0, ℓ1([0, 1]) R+, ℓ 1([0, 1]) R+. Further, this implies properness on extreme values, as 0 tℓ(0), 1 tℓ(1).

Case (M) Suppose tℓ(a) tℓ(a ) = for some a = a and let v tℓ(a) tℓ(a ). It means v [0, 1],

aℓ1(v ) + (1 a)ℓ 1(v ) aℓ1(v) + (1 a)ℓ 1(v), (41)

a ℓ1(v ) + (1 a )ℓ 1(v ) a ℓ1(v) + (1 a )ℓ 1(v), (42)

and so δ [0, 1], if we let aδ .= a + δ(a a), a 1 δ, δ convex combination of both inequalities yields v [0, 1],

aδℓ1(v ) + (1 aδ)ℓ 1(v ) aδℓ1(v) + (1 aδ)ℓ 1(v), v [0, 1], (43)

Being Properly Improper

which implies v tℓ(aδ) and shows the right part of Case (M). To show show the left part of Case (M); we add to (41) and (42) we now add the inequality:

aℓ1(v ) + (1 a)ℓ 1(v ) aℓ1(v) + (1 a)ℓ 1(v), (44)

with therefore v tℓ(a), implying aℓ1(v ) + (1 a)ℓ 1(v ) = aℓ1(v ) + (1 a)ℓ 1(v ) as otherwise one of v , v

would not be in tℓ(a). We then get

a ℓ1(v ) + (1 a )ℓ 1(v ) = aℓ1(v ) + (1 a)ℓ 1(v ) + (a a) (ℓ1(v ) ℓ 1(v ))

= aℓ1(v ) + (1 a)ℓ 1(v ) + (a a) (ℓ1(v ) ℓ 1(v ))

= a ℓ1(v ) + (1 a )ℓ 1(v ) + (a a) , (45)

with .= ℓ1(v ) ℓ 1(v ) (ℓ1(v ) ℓ 1(v )). Considering (45), we deduce from (42) that to have v tℓ(a ), we equivalently need (a a) 0. We also know by assumption that ℓ1 is non-increasing and ℓ 1 is non-decreasing, so g(u) .= ℓ1(u) ℓ 1(u) is non-increasing. We thus have (a a) 0 iff one of the two possibilities hold:

a a and v v , or

a a and v v ,

which shows the right part of Case (M).

Case (E) we have L(0) = infv [0,1] ℓ 1(v) = 0 for v = 0, hence 0 tℓ(0). Similarly, L(1) = infv [0,1] ℓ1(v) = 0 for v = 1, hence 1 tℓ(1).

Case (D) we have

d du L(u) = ℓ1(tℓ(u)) + uℓ 1(tℓ(u))t ℓ(u) ℓ 1(tℓ(u)) + (1 u)ℓ 1(tℓ(u))t ℓ(u)

= ℓ1(tℓ(u)) ℓ 1(tℓ(u)) + t ℓ(u) (uℓ 1(tℓ(u)) + (1 u)ℓ 1(tℓ(u))), (46)

but since v = tℓ(u) is the solution to (41) it satisfies uℓ 1(tℓ(u)) + (1 u)ℓ 1(tℓ(u)) = 0 , so that (46) simplifies to

d du L(u) = ℓ1(tℓ(u)) ℓ 1(tℓ(u)), (47)

and since L is concave and the partial losses are differentiable,

du2 L(u) = t ℓ(u) (ℓ 1(tℓ(u)) ℓ 1(tℓ(u))) 0, u, (48)

which proves the statement of the Lemma.

Case (S) Suppose v tℓ(a), which implies

aℓ1(v ) + (1 a)ℓ 1(v ) aℓ1(v) + (1 a)ℓ 1(v), v [0, 1]. (49)

We also note that since symmetry holds, aℓ1(v ) + (1 a)ℓ 1(v ) = (1 a)ℓ1(1 v ) + aℓ 1(1 v ), which implies because of (49) 1 v tℓ(1 a).

Remark: even if we assume the partial losses to be strictly monotonic, the tilted estimate can still be set valued. To see this, craft the partial losses such that v tℓ(u) and then for some w > v, replace the partial losses in the interval [v, w] by affine parts w/ slope a < 0 for ℓ1, b > 0 for ℓ 1 and such that b/a = u/(1 u) which guarantees L(u, v) = L(u, w) and thus w tℓ(u);

Being Properly Improper

A.4. Proof of Lemma 6

We recall the focal loss corresponding pointwise conditional risk in lieu of (2):

L γ(u, v) .= v (1 u)γ log u (1 v) uγ log(1 u), (50)

and if it is twist proper, then for any ηt, ηc [0, 1], there exists γ 0 such that

u L γ(u, ηt) u=ηc = 0. (51)

Equivalently, we must find γ 0 such that (keeping notations u .= ηc, v .= ηt for clarity):

(1 v)uγ (γ(1 u) log(1 u) u) = v(1 u)γ (γu log u + u 1) , (52)

and we see that twist properness implies the statement that for any K 0 (note that K = v 1 v) and any u [0, 1), there exists γ 0 such that

f(u, γ) f(1 u, γ) = K, (53)

f(u, γ) .= uγ (γ(1 u) log(1 u) u) . (54)

We study the ratio for u [0, 1/2]. We have f(u, γ) 0, u [0, 1], γ 0 and

γ f(u, γ) = uγ(γ a(u) b(u)), (55)

with a(u) .= (1 u) log(1 u) log(u) 0 and b(u) .= u log u (1 u) log(1 u), satisfying b(1 u) = b(u) and ua(1 u) = (1 u)a(u). Hence, we arrive after some derivations to

γ f(u, γ) f(1 u, γ) = (u(1 u))γ

f 2(1 u, γ) A(u)γ2 + B(u)γ + C(u) , (56)

A(u) .= u(1 u) log(u) log(1 u) log(u(1 u)), (57)

B(u) .= (u2 log2 u + (1 u)2 log2(1 u) + (1 2u)2 log u log(1 u)), (58)

C(u) .= (1 2u)b(u). (59)

All functions A, B, C are non positive for any fixed u [0, 1/2], so the ratio in (53) is non-increasing over γ 0 and as a consequence, for any fixed u [0, 1/2],

f(u, γ) f(1 u, γ) f(u, 0) f(1 u, 0) (60)

= u 1 u, γ 0, (61)

so we see that (53) cannot be satisfied when K > u/(1 u) and as a consequence, the focal loss is not twist-proper.

Twist-improperness of the Super Loss The Super Loss (Castells et al., 2020) works as a wrapper of a loss, its partial losses being defined as

Lb,λ(ℓ, σi) = (ℓb τ)σi + λ(log σi)2, (62)

where b { 1, 1} indicates the partial loss of a loss of interest, τ Imℓb, λ > 0 are user-defined parameters. σi is a functional computed to minimize the partial losses, and we get the optimal expression:

σ (ℓb) = exp ( W(1/2 max ( 2/e, β))) , (63)

with β = ℓb τ

λ (notice this is also a function via the partial loss). W is called Lambert s function. It does not have an analytical form.

Being Properly Improper

Lemma 13 Suppose loss ℓin the Super Loss is such that its partial loss ℓ1 is strictly decreasing and ℓ 1 is strictly increasing. Then the corresponding Super Loss with partial losses Lb,λ(ℓ, σ (ℓb)) (b { 1, 1}) is not twist proper.

Remark: the assumptions about the partial losses are very weak and would be satisfied by all popular losses (e.g. log, square, Matusita, etc.).

Proof The notable facts about W, useful for our proof are:

e W (z) = z W(z) and d dz W(z) = 1 z + e W (z) and sup exp( W(z)) = e. (64)

Simplifying notations above, we end up studing a loss with partial losses defined as

L λ(ℓb) = (ℓb τ)σi + λ(log σ (ℓb))2. (65)

Recall that a loss ℓis twist-proper iff for any twist, there exists hyperparameters such that ηc tℓ(ηt). Examining this for the Super Loss, we obtain

t L(v) = arginfu [0,1]L(u, v) (66)

= arginfu [0,1]v L λ(ℓ1(u)) + (1 v) L λ(ℓ 1(u)) (67)

= arginfu [0,1]

v (ℓ1(u) τ)σ (ℓ1) + λ(log σ (ℓ1))2

+(1 v) (ℓ 1(u) τ)σ (ℓ 1) + λ(log σ (ℓ 1))2 (68)

We note that if ℓis proper, then v tℓ(v). Computing the minimum in (68), we obtain

duv (ℓ1(u) τ)σ (ℓ1) + λ(log σ (ℓ1))2 + (1 v) (ℓ 1(u) τ)σ (ℓ 1) + λ(log σ (ℓ 1))2 . (69)

Similar to the computation of the focal loss, we need

d du(ℓ1(u) τ)σ (ℓ1) + λ(log σ (ℓ1))2

d du(ℓ 1(u) τ)σ (ℓ 1) + λ(log σ (ℓ 1))2 = (1 v)

v = K, (70)

and this needs to hold (via the choice of parameters τ, λ) for any u [0, 1) and K > 0. To save notations, define

βb(u) = ℓb(u) τ

Remark that if ℓb(u) > τ (2λ)/e, we have σ (ℓb(u)) = exp( W(βb(u))) and so

d du(ℓb(u) τ)σ (ℓb(u)) + λ(log σ (ℓb(u)))2

βb(u) exp( W(βb(u))) + W 2(βb(u))

= 2λ β b(u) exp( W(βb(u))) βb(u)β b(u) exp( W(βb(u))) βb(u) + exp(W(βb(u))) + β b(u)W(βb(u)) βb(u) + exp(W(βb(u)))

= 2λβ b(u) 1 + βb(u) exp( W(βb(u))) βb(u) exp( W(βb(u))) + W(βb(u))

βb(u) + exp(W(βb(u)))

= ℓ b(u) 1 + W(βb(u)) βb(u) + exp(W(βb(u))) (71)

= ℓ b(u) exp( W(βb(u))), (72)

since indeed it comes from (64),

1 + W(z)) z + exp(W(z)) = exp( W(z)); (73)

Being Properly Improper

also, if ℓb(u) τ (2λ)/e, we have σ (ℓb(u)) = exp( W( 1/e)) = e and so

d du(ℓb(u) τ)σ (ℓb(u)) + λ(log σ (ℓb(u)))2 = ℓ b(u) e. (74)

Since limz 1/e+ exp( W(z)) = e, we can summarize both (72) and (74) as

d du(ℓb(u) τ)σ (ℓb(u)) + λ(log σ (ℓb(u)))2 = ℓ b(u) exp( W(max{ 1/e, βb(u)})). (75)

Now consider a loss ℓsatisfying ℓ 1 strictly increasing and ℓ1 strictly decreasing. Pick u so that we have simultaneously

ℓ1(u) > τ 2λ

ℓ 1(u) τ 2λ

which, assuming both inequalities fit in the range of the respective partial loss, that u [0, γ] for some γ > 0. Rewriting (70), we need to show that for any such γ > 0 and u [0, γ] and K > 0, there exists a choice of τ, λ such that

ℓ 1(u) exp( W(β1(u)))

ℓ 1(u) e = K, (78)

which rewrites conveniently as

exp( W(β1(u))) = Ke ℓ 1(u) ℓ 1(u) , (79)

exp( W(β1(u))) = K , (80)

for any K > 0 (the RHS of (79) is indeed always strictly positive). But sup exp( W(z)) = e (64), so (80) cannot hold and the Super Loss is not twist proper.

A.5. Proof of Lemma 8

For part (a): we cite (Sypherd et al., 2022) which demonstrates (M), (D), (S) for α > 0. With our extension of α-loss, these can also be readily shown for α < 0, since they are mapped back to the α > 0 losses.

For part (b): we know from Lemma 2 that α-loss, for α R \ {0, }, due to strict convexity, returns a singleton, i.e., |tℓα(ηt)| = 1. With regards to that singleton, we know from (Sypherd et al., 2022; Liao et al., 2018) for α > 0 that

tℓα(ηt) = ηα t ηαt +(1 ηt)α . A very similar calculation recovers tℓα(ηt) = η α t η α t +(1 ηt) α for α < 0. Multiplying the numerator

and denominator of this expression by (1 ηt)α, we can simply write both expressions using ηα t ηαt +(1 ηt)α . Regarding the limit as α yielding tℓ (ηt) = 1 or 1, this was also already shown by (Sypherd et al., 2022) for + and is similarly (readily) extended for the α case.

For part (c): here, we break entirely new ground. Let α > 0. To obtain twist-properness as stipulated in Definition 5, we seek to know for what α the following holds

ηc = ηα t ηα t + (1 ηt)α . (81)

Being Properly Improper

0 0.2 0.4 0.6 0.8 1 Probability

Partial Loss of -loss

= 0.35 =0.5 = 0.7 = 1 [Log-Loss] = 1.8 = [0-1 Loss]

Figure 4: A plot of ℓα 1 (u) for α > 0 as given in Definition 7.

Solving for α, we obtain

ηc = ηα t ηα t + (1 ηt)α (82)

1 ηc = 1 + 1

ηt 1 α (83)

ηt 1 α (84)

ηc 1 = α log 1

α = log 1 ηc

After multiplying the numerator and denominator by 1, we obtain the desired result. Namely, α-loss is twist-proper for

ι(ηt) . (87)

Interestingly, (87) is the ratio of the logits (which is the link function L of the log-loss, α = 1) evaluated at the clean posterior and twisted posterior, in essence, a kind of ratio test.

For part (d): we recall the definition of Bayes blunting twist from Definition 3: a twist ηc 7 ηt is Bayes blunting iff (ηc ηt 1/2) (ηc ηt 1/2). Also, recall that α is given in (87), and see Figure 5 for a plot of the logit function. Let ηc 1/2. The Bayes blunting twist can take ηt from ηc ηt 1/2. If ηt = ηc, then α = 1. If ηt 1/2, as can be seen in the figure, the sign crossover point is 1/2, so α . Thus, by continuity, we have that α 1. Finally, the case where ηc < 1/2 follows, mutatis mutandis.

A.6. Proof of Theorem 9

We want to show that for any strictly Bayes blunting twist ηc 7 ηt, there exists a fixed α0 > 1 and an optimal α -mapping, α : X R>1, which induces the following ordering

DKL(ηc, ηt; 1) > DKL(ηc, ηt; α0) DKL(ηc, ηt; α ). (88)

Being Properly Improper

Figure 5: A plot of the logit ι(u) .= log(u/(1 u)).

Recalling (9), which is the identity

DKL(ηc, ηt; α) = CE(ηc, ηt; α) H(ηc) (89)

ηc(X) log ηc(X) tℓα(ηt(X))

+ (1 ηc(X)) log 1 ηc(X) 1 tℓα(ηt(X))

by subtracting H(ηc) from both sides, we rewrite the desired statement (10) (also given here in (88)) as

CE(ηc, ηt; 1) > CE(ηc, ηt; α0) CE(ηc, ηt; α ). (91)

In essence, we want to show that

CE(ηc, ηt; α) < CE(ηc, ηt; 1) OR 0 < CE(ηc, ηt; 1) CE(ηc, ηt; α), (92)

for some α0 > 1. Continuing with the right-hand-side of (92), we have

CE(ηc, ηt; 1) CE(ηc, ηt; α) (93)

= EX M[ηc(X) log ηt(X) + (1 ηc(X)) log (1 ηt(X))]

EX M[ηc(X) log tℓα(ηt(X)) + (1 ηc(X)) log (1 tℓα(ηt(X))] (94)

ηc(X) log tℓα(ηt(X))

+ (1 ηc(X)) log 1 tℓα(ηt(X))

ηc(X) log ηt(X)α 1

ηt(X)α + (1 ηt(X))α

+ (1 ηc(X)) log (1 ηt(X))α 1

(1 ηt(X))α + ηt(X)α

where we used the linearity of the expectation and some algebra to combine the expressions. We want to show that the expression in brackets in (96) is strictly positive as this implies that 0 < CE(ηc, ηt; 1) CE(ηc, ηt; α), in words, that the α-Bayes tilted estimate untwists the Bayes blunting twist. Continuing, we examine the expression in brackets in (96)

fα,ηc(ηt) = ηc log ηα 1 t ηα t + (1 ηt)α + (1 ηc) log (1 ηt)α 1

ηα t + (1 ηt)α (97)

= (α 1)ηc log ηt + (α 1)(1 ηc) log (1 ηt) log (ηα t + (1 ηt)α), (98)

where we implicitly fix X = x and consider scalar-valued ηc, ηt [0, 1] and α R+/{1}. We note that α < 0 does not need to be considered in this analysis since that regime of α is primarily useful for very strong twists due to its symmetry

Being Properly Improper

Figure 6: Characteristic plot of the non-negative part of fα,ηc(ηt), where ηc = 0.9, as a function of ηt for several values of α. Recall that the non-negative region of f indicates where using Bayes tilted α-estimate, as measured with the cross entropy for α given in (7), is strictly less than the α = 1 cross entropy. Also recall that a Bayes blunting twist has the capability to shift ηt anywhere in [.5, ηc = 0.9]. We see that for small α, more twisted probabilities are covered , whereas for large α, less twisted probabilities are covered , however, the large α s induce a large positive magnitude (ultimately measured by the KL-divergence) increase over the proper α = 1. A key takeaway is that a fixed α (small enough) can correct a Bayes blunting twist for almost all x X. However, it is not necessarily optimal as a perfectly tuned α-mapping will use larger α s to optimally correct strongly twisted posteriors, inducing more gains over the α = 1 cross entropy.

property (recall in Lemma 8 that tℓα is symmetric upon permuting (ηt, α) and (1 ηt, α)), i.e., not useful for Bayes blunting twists which reduce confidence in the posterior but do not flip its sign across ηt 1/2. To build intuition of f, see Figure 6 for a plot of this function. Formally, we take note of the following observations/properties of f:

1. CONTINUITY. From (98), it can be readily shown that for any fixed ηc, ηt [0, 1], fα,ηc(ηt) is continuous in α 1.

2. CONCAVITY. For arbitrarily fixed ηc and for any α > 1, fα,ηc( ) is concave in ηt, since (from (97)) the composition of a concave function with a non-decreasing concave function yields a concave function. As a side note, observe that fα,ηc( ) is convex for 0 < α < 1, thus this regime of α does not untwist Bayes blunting twists. Regarding (increa/decrea)sing concavity of fα,ηc( ) for any fixed ηc [0, 1] as a function of α, traditionally a second derivative argument could indicate whether concavity is increasing or decreasing as a function of α. Unfortunately, d2

dη2t fα,ηc(ηt)

is an unwieldy analytical expression. However, using a Taylor series approximation of d2

dη2t fα,ηc(ηt) near ηt = 1/2, we

find that the dominating term is α2. Thus, while not a proof, this indicates that concavity of fα,ηc( ) increases as α increases greater than 1, which is sufficient for our purposes in the sequel.

3. ZEROES. It can be readily shown that for every ηc [0, 1], fα,ηc(1/2) = 0 for any α > 1. Further, it can be shown that for any ηc [0, 1], lim α 1+ fα,ηc(ηc) 0 . Thus, the exact values of ηt for the other zero of fα,ηc( ) (not ηt = 1/2),

for each α > 1, are given by the solution to the following transcendental equation:

ηt = (1 ηt)α 1 1 1

ηc (ηα t + (1 ηt)α)

1 ηc 1 α 1 , (99)

Being Properly Improper

Figure 7: Symmetric image of Figure 6 for ηc = 0.1.

which can be rewritten as

ηt (1 ηt)1 1

= α α 1 log η

+ 1 ηc(α 1) log 1 + 1

ηt 1 α , (100)

and can also be rewritten as ηt 1 ηt

ηc(α 1) = ηt

α 1 + (1 ηt). (101)

Suppose ηc > 1/2, then since we have a Bayes blunting twist, ηc ηt 1/2. Letting α , note that the second term on the right-hand-side is 0. Thus, we can solve for the zeroes from (100) when α = by examining

log ηt (1 ηt)1 1/ηc

After some manipulations, we obtain log 1 ηt 1 = 0, which is only satisfied when ηt = 1/2. Thus, for ηc > 1/2 and α , both zeroes of (97) converge at ηt = 1/2. For ηc < 1/2, the same argument holds, mutatis mutandis. Lastly, from (101), it can be shown that given fixed ηc and ηt under a Bayes blunting twist, a solution α > 1 must exist,

through reasoning about the rate of increase of ηt 1 ηt

α 1 as a function of α, which is common to both sides.

4. MAXIMUM. It can also be shown that the maximum of fα,ηc( ) for each α > 1 as a function of ηt is given by the following transcendental equation

α(1 ηc) + ηc ηt

αηc ηc + ηt = 1

ηt 1 α . (103)

One key observation we can make from (103) is that as α increases, the term on the right-hand-side grows (or decays) exponentially with α, whereas the term on the left-hand-side is linear in α. With case-by-case analysis, i.e., for ηc > 1/2 or ηc < 1/2, it can be reasoned that as α increases, the solution to (103), ηt, approaches 1/2. A second key observation we can make from (103) is that as α 1+, the solution to (103), ηt, approaches ηc/2 + 1/4. This is readily observed by setting α = 1 + ϵ, for some ϵ > 0, and ηt = ηc 1/2

2 + 1/2 = ηc/2 + 1/4, along with a Taylor series approximation of (1/ηt 1)1+ϵ for ϵ near 0.

Being Properly Improper

Intuitively, the remainder of the proof consists of a covering argument. In words, we choose the least twisted ηc, i.e. η c , under ηc ηt, via its associated α0 > 1 (as given in Lemma 8(d)), then we notice that this induces non-negativity of fα0,η c (ηt) given in (97). Next, we argue that this choice of α0 > 1 implies that all ηc are covered - in the sense of inducing non-negativity of (97), i.e., fα0,ηc(ηt) > 0 for all ηc under ηc ηt. Finally, we use the non-negativity of the expectation to achieve the desired result, i.e., the left-hand-side of (88).

Continuing, let ηc ηt be a strictly Bayes blunting twist. Thus, we have that either (ηc < ηt 1/2) or (ηc > ηt 1/2) for all ηc. By ZEROES of f, we have that for every ηc [0, 1], fα,ηc(1/2) = 0 for any α > 1 and limα 1+ fα,ηc(ηc) 0 . We also have that as α , both zeroes of (97) converge at ηt = 1/2. Thus, for every ηc, the second zero (the first one is at ηt = 1/2) continuously shifts (CONTINUITY) from being located at ηt = ηc (as α 1) to being located at ηt = 1/2 (as α ). In order to identify the least twisted ηc under ηc ηt, let

α (ηc, ηt) := ι(ηc)

ι(ηt) , (104)

as stated in Lemma 8(c). By Lemma 8(d), we know that α (ηc, ηt) 1; furthermore, due to strictness of the Bayes blunting twist ηc ηt, we indeed have a strict inequality, i.e., α (ηc, ηt) > 1 for all ηc under ηc ηt. Choose α0 := inf{α (ηc, ηt) : ηc under ηc ηt}, where we break ties arbitrarily, and note that α0 > 1, again by strictness. Also note that there exists η c associated with α0 such that α0 = ι(η c )/ι(ηt) under ηc ηt, in other words, fα0,η c (ηt) > 0 (due to tℓα(ηt) reversing the effects of the Bayes blunting twist and tuning back to η c ). Thus, by CONTINUITY, CONCAVITY, and ZEROES of f above and this choice of α0 > 1, we have that for all ηc, fα0,ηc(ηt) > 0 under ηc ηt, i.e., that all ηc are covered , due to the ordering of the relative zeroes of f induced by identifying η c through α0. Therefore, we obtain from (96) and (97) (i.e., the non-negativity of the expectation) that for the chosen α0 > 1, we have that 0 < CE(ηc, ηt; 1) CE(ηc, ηt; α0), i.e.,

CE(ηc, ηt; 1) > CE(ηc, ηt; α0), (105)

as desired.

We now show that CE(ηc, ηt; α0) CE(ηc, ηt; α ), where α is a mapping such that α : X R>1, i.e., returning an α > 1 for every x X. By MAXIMUM above, we note that for a given ηc, the maximum of fα,ηc( ) moves from being achieved at ηt = ηc/2 + 1/4, to being achieved at ηt = 1/2 in the limit as α increases greater than 1. By CONCAVITY above, we also observe that fα,ηc( ) for a fixed ηc appears (which is sufficient for the inequality) to become more strongly convex in general as α increases greater than 1. We also note by CONTINUITY of f above that the maximums are continuous in α > 1. Thus, under the strictly Bayes blunting twist ηc ηt, for every ηc, there may exist an α > 1 which induces a larger (positive) magnitude in fα,ηc(ηt) than for the fixed α0 > 1 we found previously (for (105)). Thus, there exists an optimal mapping α : X R>1, such that

CE(ηc, ηt; α0) CE(ηc, ηt; α ). (106)

Note that in the degenerate case, α = α0 for every x X. Therefore, combining (105) and (106), we obtain

CE(ηc, ηt; 1) > CE(ηc, ηt; α0) CE(ηc, ηt; α ), (107)

which is the desired result. Lastly, note from Lemma 8(c) and (90) that α : X R>1 is indeed given by α (x) := ι(ηc(x))/ι(ηt(x)), for every x X, and hence note that CE(ηc, ηt; α ) = H(ηc), i.e., from (9) that DKL(ηc, ηt; α ) = 0.

A.7. Proof of Theorem 10

As explained in the main body, we prove a result more general than Theorem 10. However, briefly note that (13), like the statement provided in Theorem 9 in (10), is proved for CE, but the statement provided in the main body as KL is readily obtained by subtracting H(ηc) from both sides of the inequality.

First, we need a simple technical Lemma.

Lemma 14 For any B > 0, |z| B, α R,

log(1 + exp(αz)) log(1 + exp(αB)) B z

Being Properly Improper

Proof We first note that |z| 1, α R,

log(1 + exp(αz)) 1 + z

2 log(1 + exp(α)) + 1 z

2 log(1 + exp( α)) (109)

= log(1 + exp(α)) 1 z

which indeed holds as the LHS of (109) is convex and the RHS is the equation of a line passing through the points ( 1, log(1 + exp( α))) and (1, log(1 + exp(α))). In (110), we use log (1 + exp( α)) = log (exp( α) (1 + exp(α))) on the second term in the RHS. Hence if instead |z| B, then

log(1 + exp(αz)) = log 1 + exp αB z

log(1 + exp(αB)) B z

as claimed.

We now show another Lemma which bounds the log quantities appearing in the cross-entropy in (7), recalling that ι(u) .= log(u/(1 u)).

Lemma 15 Fix B > 0. For any x X such that

1 1 + exp B ηt(x) exp B 1 + exp B , (113)

the following properties hold for the Bayes tiltes estimate tℓof α-loss:

log tℓα(ηt(x)) log(1 + exp(αB)) α B + ι(ηt(x))

log(1 tℓα(ηt(x))) log(1 + exp(αB)) α B ι(ηt(x))

Proof We note, using z .= log 1 ηt(x)

ηt(x) , which satisfies |z| B from (113) and Lemma 14,

log tℓα(ηt(x)) = log ηt(x)α

ηt(x)α + (1 ηt(x))α

1 + 1 ηt(x)

= log 1 + 1 ηt(x)

= log 1 + exp α log 1 ηt(x)

log(1 + exp(αB)) α B log 1 ηt(x)

= log(1 + exp(αB)) α B + ι(ηt(x))

and similarly,

log(1 tℓα(ηt(x))) = log 1 + exp α log 1 ηt(x)

log(1 + exp( αB)) + α B + ι(ηt(x))

= log(1 + exp(αB)) αB + α B + ι(ηt(x))

= log(1 + exp(αB)) α B ι(ηt(x))

Being Properly Improper

as claimed.

Denote M(B) the distribution restricted to the support for which we have a.s. (113) and let p(B) be the weight of this support in M. Let M(B) denote the restriction of M to the complement of this support. We let D(B) is the product distribution on examples (X Y) over the support of M(B) induced by marginal M(B) and posterior ηc (see Reid & Williamson (2011, Section 4)). We are now in a position to show our generalization to Theorem 10.

Theorem 16 For any fixed B > 0, let

e(B) .= E(X,Y) D(B) [Y ι(ηt(X))]

B [ 1, 1]. (119)

and suppose we fix the scalar α .= α with

α .= ι 1+e(B)

then the following bound holds on the cross-entropy of the Bayes tilted estimate of the α-loss:

CE(ηc, ηt; α) p(B) H 1 + e(B)

+(1 p(B)) e(B) log 1 + |e(B)|

where e(B) .= E(X,Y) D(B) [max {0, sign(α ) Y ι(ηt(X))}] /B and D(B) is is defined analogously to D(B) with respect to M(B).

Remark: we notice this is indeed a generalization of Theorem 10, which corresponds to case p(B) = 1. We also note |e(B)| 1.

Proof We remark that the cross-entropy (7) can be split as:

CE(ηc, ηt; α) .= EX M

ηc(X) log tℓα(ηt(X)) +(1 ηc(X)) log(1 tℓα(ηt(X)))

= p(B) K(α) + (1 p(B)) L(B), (121)

K(α) .= EX M(B)

ηc(X) log tℓα(ηt(X)) +(1 ηc(X)) log(1 tℓα(ηt(X)))

J(B) .= EX M(B)

ηc(X) log tℓα(ηt(X)) +(1 ηc(X)) log(1 tℓα(ηt(X)))

We now focus on a bound on K(α), which we achieve via Lemma 15:

K(α) EX M(B)

ηc(X) log(1 + exp(αB)) α B+ι(ηt(X))

+(1 ηc(X)) log(1 + exp(αB)) α B ι(ηt(X))

= log(1 + exp(αB)) α B + EX M(B) [ηc(X)ι(ηt(X)) + (1 ηc(X)) ι(ηt(X))]

= log(1 + exp(αB)) α B + E(X,Y) D(B) [Y ι(ηt(X))]

.= log(1 + exp(αB)) α B + B e(B)

2 | {z } .=L(α)

Being Properly Improper

Figure 8: A plot illustrating the closeness of (124) where K(α) is given in blue and L(α) is given in red for a toy distribution: x Uniform([ B, B]) (recall B > 0 is the clipping threshold and we set B = 2 here) where ηc(x) = (1+exp ( x/a)) 1

and ηt(x) = (1 + exp ( x/b)) 1 such that a = 10 and b = 0.6.

where we recall

e(B) .= EX D(B) [Y ι(ηt(X))]

B [ 1, 1]. (125)

Notice the change in distribution in (124), where D(B) is the product distribution on examples (X Y) over the support of M(B) induced by marginal M(B) and posterior ηc (see Reid & Williamson (2011, Section 4)). We have

L (α) = B exp(Bα) 1 + exp(Bα) 1 + e(B)

which zeroes for

α = 1 B log 1 + e(B)

Further, we have that

dαL (α) = B2 exp (αB) (exp (αB) + 1)2 , (128)

and plugging in (127) yields

L (α ) = B2 1 e2

Note that for fixed B > 0, as |α| increases in (128), L (α) decreases. Thus, when the magnitude of α is large (due to the distribution and twist), this implies that there is more flatness near the choice of α . Hence, in these regimes, a choice of α0 close-enough to α should have similar performance in practice. Continuing with the main line, plugging in (127)

Being Properly Improper

into (124) yields

K(α ) log(1 + exp(Bα )) B 1 + e(B)

= log 1 e(B)

2 log 1 + e(B)

2 log 1 + e(B)

2 log 1 e(B)

= H 1 + e(B)

which is the statement of Theorem 10. We now focus on J(B). Since log(1 + exp( z)) exp( z), z via an order-1 Taylor expansion, it follows that if z C for some C > 0, then log(1 + exp( z)) exp( C). Equivalently, we get

z C log(1 + exp(z)) z + exp( C). (134)

By symmetry, we have

z C log(1 + exp(z)) exp( C), (135)

|z| C log(1 + exp(z)) max{0, z} + exp( C). (136)

By definition, we have for any x in the support of M(B), log 1 ηt(x)

so have, considering C .= B |α |, from (114) and (117),

J(B) .= EX M(B)

ηc(X) log tℓα(ηt(X)) +(1 ηc(X)) log(1 tℓα(ηt(X)))

= E(X,Y) D(B)

log 1 + exp Yα log 1 ηt(X)

E(X,Y) D(B)

max 0, Yα log 1 ηt(X)

+ exp ( B |α |)

= |α | E(X,Y) D(B) [max {0, sign(α ) Y ι(ηt(X))}] + 1 |e(B)|

= e(B) log 1 + |η(B)|

1 + |e(B)|, (139)

which completes the proof of Theorem 16 after replacing the expression of α .

Remarks: Theorem 16 calls for several remarks:

Gains with respect to the proper" choice α = 1: the case we develop is simplistic but allows a graphical comparison of the gains that Theorem allow to get compared to the choice α = 1, which we recall corresponds to the (proper) logistic loss. Suppose p(B) = 1 so the cross-entropy CE(ηc, ηt; α) in (121) reduces to K(.), B = 1 and all logits take 1 value a.e.,

z(x) .= log ηt(x) 1 ηt(x)

which can be achieved by clamping, and p .= P(X,Y) M(1)[Yz(X) = 1], which gives e(1) = 2p 1,

α = log p 1 p

Being Properly Improper

Figure 9: Comparison between the cross-entropy of the logistic loss (α = 1) and that of the α-loss for the scalar correction in (141) in Theorem 16.

CE(ηc, ηt; α ) H(p) (141)

from Theorem 16. The properness choice α = 1 however gives

CE(ηc, ηt; 1) = K(1) = E(X,Y) M(1) [log (1 + exp ( Yz(X)))] (142)

= p log(1 + exp( 1)) + (1 p) log(1 + exp(1)). (143)

= log(1 + e) p. (144)

Figure 9 plots CE(ηc, ηt; α ) (141) vs CE(ηc, ηt; 1) (144). We remark that CE(ηc, ηt; α ) CE(ηc, ηt; 1), and the difference is especially large as p {0, 1}, for which CE(ηc, ηt; α ) 0 while we always have CE(ηc, ηt; 1) > 0.3, p.

Incidence of computing α on an estimate of e(B): Theorem 16 can be refined if, instead of the true value e(B) we have access to an estimate ˆe(B). In this case, we can refine the proof of the Theorem from the series of eqs in (133). We remark that

= 1 2 log 1 z

so since H is concave, we have for any e(B), ˆe(B),

H 1 + ˆe(B)

2 1 + ˆe(B)

2 log 1 ˆe(B)

= H 1 + ˆe(B)

+ e(B) ˆe(B)

4 log 1 ˆe(B)

H 1 + ˆe(B)

+ |e(B) ˆe(B)|

4 log 1 + |ˆe(B)|

= H 1 + ˆe(B)

+ |e(B) ˆe(B)|

4 log 1 + 2|ˆe(B)| 1 |ˆe(B)|

H 1 + ˆe(B)

+ |e(B) ˆe(B)||ˆe(B)|

2(1 |ˆe(B)|) , (146)

Being Properly Improper

where we have used log(1 + z) z for the last inequality.

Polarity of α : as presented in the main body, the state of the art defines the α-loss only for α 0. The proof of Theorem 16, and more specifically its proof, hints at why alleviating this constraint is important and corresponds to especially difficult cases. We have the general rule α 0 iff e(B) 0, which indicates that the twisted posterior tends to be small when the clean posterior tends to be large. Since the Bayes tilted estimate is symmetric if we switch the couple (α, ηt) for ( α, 1 ηt), α 0 provokes a change of polarity in the Bayes tilted estimate compared to the twisted posterior. It thus corrects the twisted posterior. We emphasize that such a situation happens for especially damaging twists (in particular, not Bayes blunting).

A.8. Pseudo-Inverse Link

Derivation of (16): From the definition of F in (3), we have that

F(z) := ( L) ( z), z R. (147)

Given L(u) associated with an underlying CPE loss, we have that

( L) (z) = sup u [0,1] [zu + L(u)]. (148)

Taking the derivative of the expression in brackets and solving for u, we obtain (assuming strictly concave L)

z + L (u) = 0 (149)

u = (L ) 1( z). (150)

Plugging (150) back into the expression in brackets in (148), we obtain

( L) (z) = z(L ) 1( z) + L((L ) 1( z)), (151)

F(z) = ( L) ( z) = z(L ) 1(z) + L((L ) 1(z)). (152)

Then, we have by the chain rule that

d dz F(z) = d

dz z(L ) 1(z) + L((L ) 1(z)) (153)

= (L ) 1(z) z (L ) 1(z) + L ((L ) 1(z)) (L ) 1(z) (154)

= (L ) 1(z), (155)

where the last step is obtained since L ((L ) 1(z)) (L ) 1(z) = z (L ) 1(z) . Thus, we have that

F (z) := (L ) 1(z) = ( L ) 1( z). (156)

From property (D) of Lemma 12 in Appendix A.3 in (47), namely that L (u) = ℓ1(tℓ(u)) ℓ 1(tℓ(u)), and from (156), we get that

F (z) = (ℓ 1 tℓ ℓ1 tℓ) 1( z), (157)

as desired.

PIL Approximation: Let α [ , ], and define the conjugate αc such that 1/αc + 1/α = 1, using by extension αc( ) = 1, αc(1) = . If we were to exactly implement a boosting algorithm for the α-loss, we would have to find the exact inverse of (16), which would require inverting L (v) .= αc tℓ(v)αc αc tℓ(1 v)αc. Owing to the difficulty to carry out this step, we choose a sidestep that makes inversion straightforward and can fall in the conditions to apply Theorem 11, thus making PILBOOST a boosting algorithm for the α-loss of interest. The trick does not just hold for the α-loss, so we describe it for a general loss ℓassuming for simplicity that ℓ1(1) = ℓ 1(0) = 0 and tℓ, ℓ1, ℓ 1 are invertible with ℓ1, ℓ 1 non-negative,

Being Properly Improper

Figure 10: CIL link efℓvs inverse link L for (α = 5)-loss. Notice the quality of the approximation.

conditions that would hold for many popular losses (log, square, Matusita, etc.), and the α-loss. We then approximate the link L by using just one of ℓ 1 or ℓ1 depending on their argument, while ensuring functions match in 0, 1/2, 1. We name efℓthe clipped inverse link, CIL. Letting a ℓ .= ℓ1(0)/(ℓ1(0) ℓ1(1/2)) and a+ ℓ .= ℓ 1(1)/(ℓ 1(1) ℓ 1(1/2)), our link approximation makes use of the following function: fℓ(u) .= f ℓ(u) if u 1/2 and fℓ(u) .= f + ℓ(u) otherwise, with the shorthands f ℓ(u) .= a ℓ (ℓ1(1/2) ℓ1(tℓ(u))), f + ℓ(u) .= a+ ℓ (ℓ 1(tℓ(u)) ℓ 1(1/2)). The following Lemma shows, in addition to properties of fℓ, the expression obtained for the clipped inverse link for a general CPE loss.

Lemma 17 fℓ(u) = L (u), u {0, 1/2, 1}; furthermore, the clipped inverse link efℓ .= f 1 ℓ is: (i) efℓ(z) = 0

if z < ℓ1(0); (ii) efℓ(z) = t 1 ℓ ℓ 1 1 ℓ1(1/2) ℓ1(0)

ℓ1(0) z + ℓ1(1/2) if ℓ1(0) z < 0; (iii) efℓ(z) = t 1 ℓ

ℓ 1 1 ℓ 1(1) ℓ 1(1/2)

ℓ 1(1) z + ℓ 1(1/2) if 0 z < ℓ 1(1); (iv) efℓ(z) = 1 if z ℓ 1(1). Furthermore, efℓis continu-

ous and if (S) and (D) hold, then efℓis derivable on R (with the only possible exceptions of { ℓ1(0), ℓ 1(1)}).

The proof is immediate once we remark that ℓ1(1) = ℓ 1(0) = 0 bring "properness for the extremes", i.e. 0 tℓ(0), 1 tℓ(1). We now give the expression of the formulas of interest regarding Lemma 17 for the α-loss.

Lemma 18 We have for the α-loss,

2 uα uα+(1 u)α 1

αc 1 if u 1/2,

uα+(1 u)α 1

αc if u 1/2 , (158)

α +(2αcαc (αc+z)αc) 1 α if αc z 0,

2αcαc (αc z)αc 1

α +(2αcαc (αc z)αc) 1 α if 0 z αc,

Rewritten, we have that

0 z α α 1 ( α α 1 +z) 1 α 1

( α α 1 +z) 1 α 1 + 2( α α 1) α α 1 ( α α 1 +z) α α 1 1

α α α 1 z 0

2( α α 1) α α 1 ( α α 1 z) α α 1 1

( α α 1 z) 1 α 1 + 2( α α 1) α α 1 ( α α 1 z) α α 1 1

α 0 z α α 1

Figure 11 plots (160) for several values of α.

Remark. It could be tempting to think that the clipped inverse link trivially comes from clipping the partial losses themselves such as replacing ℓ1(u) by 0 if u 1/2 and symmetrically for ℓ 1(u). This is not the case as it would lead to L

Being Properly Improper

Figure 11: A plot of f(z) as a function of α as given in (160).

piecewise constant and therefore L = 0 when defined. We turn to a result that authorizes us to use Thm 11 while virtually not needing (E) and (C) for α-loss. Denote Iα .= αc [1 (1/α4), 1] (See Fig. 10).

Lemma 19 Suppose α 1.2. For efℓdefined as in Lemma 17, K 0.133 such that α-loss satisfies:

z Iα, |( efℓ ( L ) 1)(z)| K/α. (161)

Remark the necessity of a trick as we do not compute ( L ) 1 in (161). The proof, in Section A.9, bypasses the difficulty by bounding the horizontal distance between the inverses. The Lemma can be read as: with the exception of an interval vanishing rapidly with α, the difference between efℓ(that we can easily compute for the α-loss) and ( L ) 1 (that we do not compute for the α-loss), in order or just pointwise (typically for α < 10) is at most 0.14/α. We now show how we can virtually "get rid of" (E) and (C) in such a context to apply Theorem 11. Consider the following assumptions: (i) no edge falls in Iα, (ii) the weak learner guarantees γ = 0.14, (iii) the average weights, wj .= 1 wj/m, satisfies wj 0.4. Looking at Figure 10, we see that (i) is virtually not limiting at all; (ii) is a reasonable assumption on WL; remembering that a weight has the form w = efℓ( y H(x)), we see that (iii) requires H to be not "too good", see for example Figure 10 in which case w = 0.4 implies an edge y H 0.8. We now observe that given (i), it is trivial to find af to satisfy (C) since we focus only on one α-loss. Suppose α 2.7, which approaches the average value of the αs in our experiments, and finally let ζ .= 2.5/2.7 0.926. Then we get the chain of inequalities:

(F) Lem. 19 M 0.14

α = (ii) γM 1

α (WLA) 1 α ej

.= 1 αwj ej (iii) 2.5

2.7 ej .= ζ ej, (162)

and so (E) is implied by the weak learning assumption. To summarise, PILBOOST boosts the convex surrogate of the α-loss without either computing it or its derivative, and achieves boosting compliant convergence using only the classical assumptions of boosting, (R, WLA). The proof of Lemma 19 being very conservative, we can expect that the smallest value of K of interest is smaller than the one we use, indicating that (162) should hold for substantially smaller limit values in (ii, iii).

Being Properly Improper

A.9. Proof of Lemma 19

Define for short

uα + (1 u)α

uα + (1 u)α

G(u) .= 1 2 (1 u)α

uα + (1 u)α

that we study for u 1/2 (the bound also holds by construction for u < 1/2). Define the following functions:

g(u) .= 1 (2u) 1 αc , (165)

h(z) .= 1 1 + z , (166)

iα(u) .= uα, (167)

and uα .= h i α h 1(u), f(u) .= iαc(1 u) iαc(u). We remark that g is convex if α 1 while f is concave. Both derivatives match in 1/2 if

(αc)221 αc = 1, (168)

whose roots are αc < 6. It means if α 6/5 = 1.2, then (g f) 0, and so if we measure

k .= arg sup k sup x,x :g(x)=f(x )=k |x x |,

then k is obtained for x = 1, for which g(x) = 1 2 1 αc = k . We then need to lowerbound x such that f(x ) = 1 2 1 αc , which amounts to finding x such that f(x ) 1 2 1 αc , since f is strictly decreasing. Fix

A series expansion reveals that for x = x and K = log 2,

f(x ) = g(x ) + O 1

and we thus get that there exists K log 2 such that

sup k sup x,x :g(x)=f(x )=k |x x | K

or similarly for any ordinate value, the difference between the abscissae giving the value for f and g are distant by at most K/α. The exact value of the constant is not so much important than the dependence in 1/α: we now plug this in the uαs notation and ask the following question: suppose f(uα) = g(vα) = k. Since |uα vα| K/α, what is the maximum difference |u v| as a function of α ? We observe

uuα = α(u(1 u))α 1

(uα + (1 u)α)2 , (172)

u2 uα = α (u(1 u))α 2((α 2u + 1)uα (α + 2u 1)(1 u)α)

(uα + (1 u)α)3 , (173)

which shows the convexity of uα as long as u 1 u

α α + 2u 1 α 2u + 1, (174)

Being Properly Improper

a sufficient condition for which given the RHS increases with u is

1 + 4 α 1 1

Since u 1/2, we note the constraint quickly vanishes. In particular, if α 5, the RHS is 1/2, so uα is strictly convex. Otherwise, scrutinising the maximal values of the derivative for α 1 reveals that if we suppose v δ for some δ, then |u v| is maximal for v = δ. So, suppose vα = ϵ and we solve for uα = K/α + ε, which yields

= ((1 ε)α K)

1 α + ((1 ε)α K)

1 α , (177)

while the v producing the largest |u v| is:

v = (1 ε) 1 α

ε 1 α + (1 ε) 1 α , (178)

|v u| = (v u)(ε) = (1 ε) 1 α

ε 1 α + (1 ε) 1 α ((1 ε)α K)

1 α + ((1 ε)α K)

1 α . (179)

ε = 1 α4 , (180)

then we get after separate series are computed in α + ,

|v u| = (v u)(ε) = log(1 + log K)

The "forbidden interval" for v is then " (α4 1) 1 α

1 + (α4 1) 1 α , 1

α , 1 ; (183)

what is more interesting for us is the corresponding forbidden images for g(vα), which are thus

α4 , 1 .= Iα, (184)

where we use shorthand z [a, b] .= [az, bz]. This, we note, translates directly into observable edges since g is the function we invert. Summarizing, we have shown that if (i) α 1.2 then for any u, v such that F(u) = G(v) Iα, then |u v| 0.133/α. It suffices to remark that Iα represents the set of forbidden weights to get the statement of the Lemma.

A.10. Proof of Theorem 11

Remark. Consider the following setup: h. [ 1, 1], use the logistic loss surrogate for F and the derivative of Matusita s loss (just for illustration as the pseudo-inverse-link approximation of the logistic loss) for the weights (see e.g., (Nock &

Being Properly Improper

Nielsen, 2008) Table 1), we get in the worst case that j(F) 2 sup 1

1+x2 1 1+exp(x) < 0.24707. So, all we need is |ej| > 0.24707 to get ζ < 1 constant. Since |ej| [0, 1], this constraint is more than reasonable and turns into a very reasonable penalty in Q(F) (Theorem 11).

Remark. PILBOOST and its convergence analysis bring a side contribution of ours: it is impossible to exactly encode in standard machine types the inverse link of losses like the log loss, so the implementation of classical boosting algorithms (Friedman, 2001; Friedman et al., 2000) can only rely on approximations of the inverse link function. Our results yield convergence guarantees for the implementations of such algorithms, and (E) can be interpreted and checked in the context of machine encoding. Two additional remarks hold regarding convergence rate: first, the 1/γ2 dependence meets the general optimum for boosting (Alon et al., 2021); second, the 1/ε2 dependence parallels classical training convergence of convex optimization (Thekumparampil et al., 2020) (and references therein). There is however a major difference with such work: PILBOOST requires no function oracles for F (function values, (sub)gradients, etc.). This sideways fork to minimizing F pays (only) a 1/(1 ζ)2 factor in convergence.

We proceed in two steps, assuming (WLA) holds for WL and (R) holds for the weak classifiers.

In Step 1, we show that for any loss defined by F twice differentiable, convex and non-increasing, for any z R, as long as F satisfies assumptions (E) and (C) for T iterations such that

t=0 w2 t 2F (F(0) F(z ))

γ2(1 ζ)2(1 π2) , (185)

we have the guarantee on the risk defined by F:

Ei S [F(yi HT (xi))] F(z ). (186)

Let F be any twice differentiable, convex and non-increasing function. We wish to find a lowerbound on the decrease of the expected loss computed using F:

Ei S [F(yi Ht(xi))] Ei S [F(yi Ht+1(xi))] , (187)

where with a slight abuse of notation we let Ht denote the learned real-valued classifier at iteration t. We make use of a similar proof technique as in Nock & Williamson (2019, Theorem 7). Suppose

Ht+1 = Ht + βj hj, (188)

index j being returned by WL at iteration t. For any such index j, any g : R R+ and any H RX, let

e(j, g, H) .= Ei S [yihj(xi) g(yi H(xi))] , (189)

denote the expected edge of hj on weights defined by the couple (g, H). Furthermore, let

(g1, g2) .= |e(j, g1, Ht) e(j, g2, Ht)| , (190)

denote the discrepancy between two expected edges defined by g1, g2, respectively.

There are two quantities we consider. First, let

X .= Ei S [(yi Ht(xi) yi Ht+1(xi))F (yi Ht(xi))] (191)

= βj Ei S [yihj(xi) F (yi Ht(xi))] (192)

= βj e(j, F , Ht) (193)

βj Ei S h yihj(xi) ef( yi Ht(xi)) i βj ( F , efs) (194)

= afe2(j, efs, Ht) afe(j, efs, Ht) ( F , efs) (195)

af(1 ζ)e2(j, efs, Ht), (196)

Being Properly Improper

where efs(z) .= efs( z) and finally (196) makes use of assumption (E). The second quantity we define is:

Y (Z) .= Ei S (yi Ht(xi) yi Ht+1(xi))2F (zi) , (197)

where Z .= {z1, z2, ..., zm} Rm. Using assumption (R) and letting F be any real such that F sup F (z), we obtain:

Y (Z) F Ei S (yi Ht(xi) yi Ht+1(xi))2

= F β2 j Ei S (yihj(xi))2

= F a2M 2 e2(j, efs, Ht). (198)

A second order Taylor expansion on F brings that there exists Z .= {z1, z2, ..., zm} Rm such that:

Ei S [F(yi Ht(xi))] = Ei S [F(yi Ht+1(xi))] + Ei S [(yi Ht(xi) yi Ht+1(xi))F (yi Ht(xi))]

(yi Ht(xi) yi Ht+1(xi))2 F (zi)

Ei S [F(yi Ht(xi))] Ei S [F(yi Ht+1(xi))] = X Y (Z)

1 ζ F a M 2

a | {z } .=Z(a)

e2(j, efs, Ht). (200)

Suppose we fix π [0, 1] and choose any

a 1 ζ F M 2 [1 π, 1 + π] . (201)

We can check that

Z(a) (1 ζ)2(1 π2)

2F M 2 , (202)

Ei S [F(yi Ht(xi))] Ei S [F(yi Ht+1(xi))] (1 ζ)2(1 π2)

2F M 2 e2(j, efs, Ht). (203)

So, taking into account that for the first classifier, we have Ei S [F(yi H0(xi))] = F(0), if we take any z R and we boost for a number of iterations T satisfying (we use notation et as a summary for e2(j, efs, Ht) with respect to PILBOOST):

t=1 e2 t 2F M 2(F(0) F(z ))

(1 ζ)2(1 π2) , (204)

then Ei S [F(yi HT (xi))] F(z ). We now assume (WLA) holds, the LHS of (204) is Tγ2. Given that we choose a = af in PILBOOST, we need to make sure (201) is satisfied for the loss defined by F, which translates to

F 1 ζ af M 2 [1 π, 1 + π] , (205)

and defines assumption (C). To complete Step 1, we normalize the edge. Letting wi .= wi/ P

k wk, wt .= 1 wt/m and

wt [ M, M], (206)

Being Properly Improper

which is then properly normalized and such that (204) becomes equivalently:

t=0 w2 t e2 t 2F M 2(F(0) F(z ))

(1 ζ)2(1 π2) , (207)

and so under the (weak learning) assumption on et that | et| γ M, a sufficient condition for (207) is then

t=0 w2 t 2F (F(0) F(z ))

γ2(1 ζ)2(1 π2) , (208)

completing step 1 of the proof.

In Step 2, we show a result on the distribution of edges, i.e. margins. (208) contains all the intuition about how the rest of the proof unfolds, as we have two major steps: in step 2.1, we translate the guarantee of (208) on margins, and in step 2.2, we translate the margin based (208) in a readable guarantee in the boosting framework (we somehow get rid of the w2 t in the LHS of (208)).

Step 2.1. Let Z .= {z1, z2, ..., zm} R a set of reals. Since F is non-increasing, we have u [0, 1], θ 0,

Pi[zi θ] > u Ei[F(zi)] > (1 u) inf z F(z) + u F(θ)

.= (1 u)F + u F(θ), (209)

so if we pick z in (208) such that

F(z ) .= (1 u)F + u F(θ), (210)

then (208) implies Ei S [F(yi HT (xi))] (1 u)F + u F(θ) and so by the contraposition of (209) yields:

Pi S [yi HT (xi) θ] u, (211)

which yields our margin based guarantee.

Step 2.2. At this point, the key (in)equalities are (208) (for boosting) and (211) (for margins). Fix κ > 0. We have two cases:

Case 1: wt never gets too small, say wt κ, t 0. In this case, granted the weak learning assumption holds on et, (208) yields a direct lowerbound on iteration number T to get Pi S [yi HT (xi) θ] u;

Case 2: wt κ at some iteration t. Since the smaller it is, the better classified are the examples, if we pick κ small enough, then we can get Pi S [yi HT (xi) θ] u straight .

This suggests our use of the notion of denseness for weights (Bun et al., 2020).

Definition 20 The weights at iteration t is called κ-dense iff wt κ.

We now have the following Lemma.

Lemma 21 For any t 0, θ R, κ > 0, if weights produced in Step 2.1 of PILBOOST fail to be κ-dense, then

Pi S [yi HT (xi) θ] κ ef( θ) . (212)

Being Properly Improper

Proof Let Z .= {z1, z2, ..., zm} R a set of reals. Since ef is non-decreasing, we have θ R,

Ei[ ef(zi)] Pi[zi < θ] inf z ef(z) + Pi[zi θ] ef( θ)

= Pi[zi θ] ef( θ) (213)

since by assumption inf ef = 0. Pick zi .= yi HT (xi). We get that if Pi S[ yi HT (xi) θ] = Pi S[yi HT (xi) θ] ξ, then wt .= Ei S[ ef( yi HT (xi))] ξ ef( θ). If we fix

ξ = κ ef( θ) , (214)

then wt < κ implies (212), which ends the proof of Lemma 21.

From Lemma 21, we let κ .= ξ ef( θ) and u .= ξ in (211). If at any iteration, HT fails to be κ-dense, then Pi S [yi Hβ(xi) θ] ξ and classifier Hβ satisfies the conditions of Theorem 11 (this is our Case 2 above).

Otherwise, suppose it is always κ-dense (this is our Case 1 above). We then have at any iteration T P t<T w2 t Tξ2 ef 2( θ) and so a sufficient condition to get (208) is then T 2F (F (0) F (z )) ξ2 e f 2( θ)γ2(1 ζ)2(1 π2), where we recall z is chosen so that

F(z ) = (1 ξ )F + ξ F(θ). This ends the proof of Theorem 11 (with the change of notation ξ ε).

B. Additional Experimental Results

In this section, we provide additional experimental results and discussion to accompany Section 6 in the main text. The code for all of our experiments (including the implementation of PILBOOST) can be found at the following github repository link:

https://github.com/Sankar Lab/Being-Properly-Improper

B.1. General Details

Most of the experiments were performed over the course of a month on a 2015 Mac Book Pro with a 2.2 GHz Quad-Core Intel Core i7 processor and 16GB of memory. The Adaptive α experiments were performed on a computing cluster and each required about 30 minutes of compute time. Code can be found in PILBoost Experiments.py, Adaptive Alpha Menon.py, Adaptive Alpha ALL.py. Averaged experiments employed 10-fold cross validation, and when twisters were present, randomization occurred over the twisted samples as well. All algorithms across all experiments ran for 1000 iterations.

B.2. Discussion of af and α

In general, we found that for most experiments, 0.1 af 15. From the theory, we know that if af is too small, boosting needs to occur for a very long time, and if af is too large, almost no loss fits to (C) (equivalently, (C) fails for us). We also generally found that PILBOOST was not particularly sensitive to the choice of af as long as it was in the right ballpark , hence our use of integer or rational values of af for all experiments. When there is twist present, we found that α > 1 performed best, where α increased as the amount of twist increased (both observations are conistent with our theory, see for example Lemma 3.4). Regarding the relationship between af and α, this appeared to depend on the dataset and depth of the decision trees.

Being Properly Improper

B.3. Random Class Noise Twister

Dataset Algorithm Random Class Noise Twister p = 0 0.15 0.3 Ada Boost 0.966 0.015 0.905 0.027 0.856 0.033 us (α = 1.1) 0.944 0.029 0.912 0.013 0.861 0.042 us (α = 2.0) 0.956 0.018 0.938 0.017 0.905 0.039 cancer us (α = 4.0) 0.957 0.014 0.917 0.012 0.922 0.032 XGBoost 0.971 0.012 0.861 0.033 0.733 0.031

Table 2: cancer feature random class noise. Accuracies reported for each algorithm and level of twister. Depth one trees. For α = 1.1, af = 7, for α = 2, af = 2, and for α = 4, af = 1.

Figure 12: Random class noise twister on the diabetes dataset. Depth 3 trees. af = 0.1 for all α.

Dataset Algorithm Random Class Noise Twister p = 0 0.15 0.3 Ada Boost 1.000 0.000 0.949 0.016 0.830 0.043 us (α = 1.1) 1.000 0.000 0.981 0.013 0.886 0.033 us (α = 2.0) 1.000 0.000 0.992 0.009 0.900 0.027 xd6 us (α = 4.0) 1.000 0.000 0.999 0.003 0.927 0.023 XGBoost 1.000 0.000 0.912 0.016 0.776 0.041

Table 3: xd6 random class noise. Accuracies reported for each algorithm and level of twister. Depth three trees. af = 8 for all α. Note that for 0% noise α = 4 used af = 0.1.

Dataset Algorithm Random Class Noise Twister p = 0 0.10 0.20 0.30 Ada Boost 0.902 0.002 0.900 0.004 0.898 0.005 0.894 0.004 us (α = 1.1) 0.901 0.005 0.899 0.003 0.897 0.004 0.890 0.004 us (α = 2.0) 0.901 0.004 0.895 0.004 0.895 0.003 0.894 0.004 Online Shopping us (α = 4.0) 0.898 0.003 0.873 0.009 0.892 0.005 0.889 0.005 XGBoost 0.893 0.005 0.874 0.002 0.842 0.006 0.782 0.008

Table 4: Accuracies reported for each algorithm and level of twister. Random training sample selected with probability p. Then, for selected training sample, boolean feature flipped with probability p for each feature, independently. Depth three trees. For α = 1.1, af = 7, for α = 2, af = 8, and for α = 4, af = 15.

Being Properly Improper

B.4. Insider Twister

Figure 13: Box and whisker visualization of scores associated with Figure 3. For all insider twister results, we fixed af = 7. Under no twister, α = 1.1, has accuracy 0.901 0.003, and XGBoost has accuracy 0.892 0.003. Under the insider twister, α = 1.1, has accuracy 0.850 0.002, and XGBoost has accuracy 0.829 0.016; under the Welch t-test, the results have a p-value of 0.004.

B.5. Discussion of XGBoost

Algorithm Average Compute Times cancer xd6 diabetes shoppers Ada Boost 1.41 0.75 1.11 13.68 us (α = 1.1) 2.19 2.01 2.19 30.88 us (α = 2.0) 1.11 0.79 2.09 21.85 us (α = 4.0) 0.96 1.35 1.82 13.01 XGBoost 0.29 0.28 0.46 3.16

Table 5: Average compute times per run (10 runs) in seconds across the datasets. Note that the values of af are chosen identically to choices in Section B.3.

XGBoost is a very fast, very well engineered boosting algorithm. It employs many different hyperparameters and customizations. In order to report the fairest comparison between Ada Boost, PILBOOST , and XGBoost, we opted to keep as many hyperparameters fixed (and similar, e.g., depth of decision trees) as possible. That being said, it appears that XGBoost inherently uses pruning, so the algorithm pruned while the other two did not. Further details regarding three other important points related to XGBoost:

1. Please refer to Table 5 for averaged compute times for the three different algorithms. In general, XGBoost had the far faster computation time among the three. However, note that PILBOOST was not particularly engineered for speed. Indeed, we estimate that the computation of f accounts for 40 50% of the total computation time, which we believe can be improved. Thus, we leave the further computational optimization of PILBOOST for future work.

2. For details regarding regularization, refer to Figure 14, where we report a comparison of regularized XGBoost and PILBOOST such that the training data suffers from the insider twister. We find that regularization improves the ability of XGBoost to combat the twister, but it is not as effective as PILBOOST.

3. For details regarding early stopping, refer to Figure 16, where we report a comparison of early-stopped XGBoost (on un-twisted validation data, i.e., cheating) and PILBOOST such that the training data suffers from the insider twister. We find that even early-stopping does not improve XGBoost s ability to combat the insider twister as effectively as PILBOOST.

Being Properly Improper

Early stopping - on an untwisted hold-out set contradicts our experiment. With early stopping enabled on a twisted hold-out set, XGBoost generally did not early stop.

Figure 14: With regularization (where λ = 20), we still observe that the feature importance of XGBoost is perturbed. Note that PILBOOST is not regularized.

B.6. Adaptive α Experiment

Being Properly Improper

Figure 15: Scores associated with Figure 14.

Figure 16: With early stopping (where XGBoost has access to clean validation data - cheating scenario), we still observe that the feature importance of XGBoost is perturbed. Note that PILBOOST is not early stopped.

Being Properly Improper

Figure 17: Scores associated with Figure 16.

Figure 18: Extended version of Figure 2 with two additional adaptive α methods. Original Learned PILBOOST estimates its choice of α by using XGBoost as an oracle. That is, the method trains XGBoost on the noisy data, then computes its confusion matrix on a clean validation set. From the confusion matrix, the label flip probability p is estimated using p = avg FP TP+FP, FN FN+TN . Next, we estimate ηc and ηt with ηc = FN+TP FP+TP+FN+TN and ηt = FP+TP FP+TP+FN+TN, respectively. Lastly similar to Menon PILBOOST, using the estimates of p, ηc, and ηt, we apply the formula in Lemma 8(c) and the SLN formula given just before Lemma 4 to obtain an estimate for α . Taylor Series PILBOOST is identical to Original Learned PILBOOST except at the last step, where a Taylor series approximation of the formula in Lemma 8(c) is used instead. We find that Menon s Method also outperforms both of these methods on the xd6 dataset, except for when Original Learned PILBOOST slightly outperforms Menon s Method in the very high noise regime. Even stronger, note that both Original Learned PILBOOST and Taylor Series PILBOOST assume more information than Menon s Method , which only uses the noisy training data, not a clean validation set.

Being Properly Improper

Figure 19: Companion Figure to Figures 2 and 18 on the diabetes dataset.

Figure 20: Companion Figure to Figures 2 and 18 on the cancer dataset.