# narrow_margins_classification_margins_and_fat_tails__a28d6294.pdf Narrow Margins: Classification, Margins and Fat Tails Francois Buet-Golfouse 1 Abstract It is well-known that, for separable data, the regularised two-class logistic regression or support vector machine re-normalised estimate converges to the maximal margin classifier as the regularisation hyper-parameter λ goes to 0. The fact that different loss functions may lead to the same solution is of theoretical and practical relevance as margin maximisation allows more straightforward considerations in terms of generalisation and geometric interpretation. We investigate the case where this convergence property is not guaranteed to hold and show that it can be fully characterised by the distribution of error terms in the latent variable interpretation of linear classifiers. In particular, if errors follow a regularly varying distribution, then the regularised and re-normalised estimate does not converge to the maximal margin classifier. This shows that classification with fat tails has a qualitatively different behaviour, which should be taken into account when considering real-life data. Introduction Margin maximisation, see for instance (Hastie et al., 2009; Vapnik, 1998), is an important concept that defines a property of finite sample optimality linked to separability between points from two different classes. However as the sample size grows, this property is less relevant in the sense that the dataset is unlikely to be separable. But margin maximisation and separating hyperplanes are still appealing for (at least) two reasons: first, they are an intuitive concept and are a benchmark in any classification task, and, second, using boosting or kernel SVM in a higher dimensional space can enable separability. (Rosset et al., 2003; 2004), building on previous work by (Bartlett et al., 2006; Freund and Schapire, 1997; Fried- 1Department of Mathematics, University College London, London, United Kingdom. Correspondence to: Francois Buet Golfouse . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). man et al., 2000; Schapire et al., 1997) and (Mangasarian, 1999), consider the case of a linear classifier (e.g., logistic regression or support vector machine) and investigate the convergence of a regularised estimator to a margin maximising hyperplane when data is separable. Intriguingly, they established that under an apparently mild criterion (see Eq. (3)) on the loss function, this convergence was guaranteed. This was an important result from a couple standpoints: first, it established a relationship between regularised classifiers and margin maximisation, and, second, it showed that usual loss functions shared that property, leading to the exact choice of a link function being of second order. The key results of our work are the (partial) answer to the open question and conjecture in (Rosset et al., 2003), on the one hand, and the link between the non convergence to a margin maximising classifier and regular variation (cf. (Bingham et al., 1987)) of the loss function, on the other hand. While margin maximisation is quite specific to the linear setting, deriving analytical properties of loss functions that are also used in other settings, such as deep learning, is particularly interesting to understand choices for loss functions and their implications. We establish a connection between this problem and heavy tails; (Taleb, 2020) offers a wide-ranging review of heavy tails in multiple applications, (Ibragimov et al., 2015) consider more specifically the role of heavy tails in finance and inference, and applications in supervised learning (mainly regression) are described in (Brownlees et al., 2015; Hsu and Sabato, 2016; Lugosi and Mendelson, 2019). Earlier approaches such as (Chatterjee and Hadi, 1986; Huber and Ronchetti, 2009; Wang et al., 2007) considered heavy tails through the lens of robust estimation. Additional research in classification tasks under a heavy-tail regime is warranted to refine the current state of understanding. Contributions Our contributions in this paper can be articulated around three questions: Is there a converse statement to (Rosset et al., 2003) s sufficient condition? In other words, if a normalised and regularised estimator converges to a marginmaximising hyperplane, must the loss function satisfy the same criterion? We show that, under some additional assumptions, namely the convexity and dif- Narrow Margins: Classification, Margins and Fat Tails ferentiability of the loss function ℓ, this indeed holds. If the ratio criterion is not verified, what can be said about the loss function? Interestingly, we establish that such losses can be shown to be in the class of regularly varying functions under mild assumptions (see (Bingham et al., 1987) for an introduction to the theory of regularly varying functions). Is there a probabilistic interpretation of these analytical results? Using the latent interpretation of binary classification models, we show that the distribution of the latent variable must also be regularly varying, loosely characterised by heavy tails. While the starting point of this work has to do with linear models and margin-maximising solutions, the characterisation of loss functions (and behaviour thereof) is of broad interest. Setup and definitions We consider the case of binary classification; we thus suppose that we have n observations of a feature vector xi Rd and label yi { 1, 1}, for i = 1, ..., n. The loss function ℓ: R R+ is supposed to depend only on the margin, is monotonic, non-increasing, non-negative and continuous, while the underlying model g(x) = βT h(xi) is taken to be linear. We thus minimise the empirical risk: min β R|H| 1 n i=1 ℓ(yiβT h(xi)), (1) where H = {h1(x), } is a finite dictionary of functions. The prediction at point x is simply sign βT h(x) . But, as pointed out by (Rosset et al., 2003), when |H| is large, it is required to add some regularisation to be able to control the complexity of the classifier: min β R|H| 1 n i=1 ℓ(yiβT h(xi)) + λ β p p, (2) for p 1. In the following, we denote by βλ (possible one of) the solution(s) to Problem (2). 1. The sufficient condition Let us start by recalling the main result from (Rosset et al., 2003): Theorem 1. (Theorem 2.1 in (Rosset et al., 2003)) Assume that the data {xi, yi}n i=1 is separable (i.e., there exists β RH such that mini yiβT h(xi) > 0. Let ℓbe a monotone non-increasing, non-negative loss function depending on the margin only. If T > 0 (possible T = + ) such that lim t T ℓ(t(1 ε)) ℓ(t) = + , (3) for all ε (0, 1), then ℓis margin maximising loss function in the sense that any convergence point of the normalised solutions βλ βλ p to the regularised problem (Eq. (2)) as λ 0 is an Lp margin maximising separating hyperplane. Consequently, if the margin maximising hyperplane is unique, then the solutions converge to it lim λ 0 βλ βλ p = arg max β, βλ p=1 min i yiβT h(xi). (4) 1.1. Interpretation The condition limt T ℓ(t(1 ε)) ℓ(t) = + has a very natural explanation to it. For the ratio condition in Eq. (3) to hold, it must be that limt T ℓ(t) = 0 (otherwise the ratio would be finite; the case limt +T ℓ(t) = + implies that ℓ(t) = + for all t given the non-increasingness of ℓ). Now, if we suppose that ℓis differentiable and that ℓ is non-zero in a neigbourhood of T, we obtain by L Hospital s rule, that lim t T ℓ(t(1 ε)) ℓ(t) = (1 ε) lim t T ℓ (t(1 ε)) ℓ (t) , (5) so that limt T ℓ (t) ℓ (t(1 ε)) = 0. In other words, the marginal utility of having a margin of size t versus a margin of size t(1 ε) goes to 0. Roughly speaking, this means that datapoints with a smaller margin with contribute a lot more to the empirical loss. 1.2. Usual loss functions It is straightforward to verify that the usual loss functions verify the criterion Eq. (3), such as the exponential loss function ℓExponential : t 7 e t (used implicity in Ada Boost, cf. (Freund and Schapire, 1997; Friedman et al., 2000)), the log-likelihood ℓLogistic : t 7 log(1 + e t) used in logistic regression, or the hinge loss ℓSVM : t 7 max(0, 1 t), which is central to support vector machines (see (Hastie et al., 2009; Vapnik, 1998)). The case T < + is only of interest for hinge-type losses where a cut-off is applied. 1.3. The case of Probit regression We can show that another well-known loss function, namely the one used in Probit regression, not considered in (Rosset et al., 2003), verifies the criterion Eq. (3). In the case of Probit regression, the associated margin loss function is defined as ℓProbit : t 7 log(Φ(t)), where Φ is the standard Gaussian cumulative distribution function. Since limt + Φ(t) = 1 and Φ (t) = 1 2πe t2/2, it comes that limt + ℓProbit(t(1 ε)) ℓProbitt) = (1 ε) limt + e t2 2 (1 (1 ε)2) = + , again by L Hospital s rule. Narrow Margins: Classification, Margins and Fat Tails 1.4. A seemingly universal result Theorem 1 combined with the fact that the most frequently used loss functions verify the criterion in Eq. (3) means that if the data is separable and the margin maximising hyperplane is unique, then the exact choice of loss function does not matter as all usual loss functions lead to the same end result. Our overall results and considerations in Section 5.3 somewhat qualify that statement. 2. The necessary condition In this Section, we aim at answering an open question in (Rosset et al., 2003) around the existence of a converse to Theorem 1. In other words, if limλ 0 βλ βλ p β , where β is a margin maximising hyperplane with unit norm, is it true that limt T ℓ(t(1 ε)) ℓ(t) = + for all ε > 0? We bring a partial positive answer to the question, and focus here on the case where T = + and p = 2, and make the additional assumptions that ℓis decreasing with limt + ℓ(t) = 0, convex and differentiable with continuous derivative ℓ . We thus consider the loss function to be minimised: L(β; λ) = 1 i=1 ℓ yiβT h(xi) + λβT β. (6) This section goes through a number of steps that were taken to reach the result, and start from the assumption that the normalised ridged solution βλ/ βλ 2 converges to a margin maximising hyperplane β with unit norm. First, we notice that we can give an expression for the normalised regularised solution vector as a linear combination of the feature vectors. Proposition 1. For a given λ > 0, the normalised solution vector βλ,1 = βλ βλ 2 can be expressed as i=1 αi,λyih(xi), where αi,λ = ℓ (mi,λ) Pn j=1 ℓ (mj,λ) 0 for all i, and Kλ > 0 is a normalising constant such that βλ,1 2 = 1. In addition, it holds 0 < 1 n mini=1, ,n h(xi) 2 K 1 λ maxi=1, ,n h(xi) 2, i.e., Kλ is always bounded by upperand lower-bounds that are independent of λ. Proof. The first-order condition of the problem reads L β = 1 n Pn i=1 ℓ (mi,λ)yih(xi) + 2λβ, leading to βλ = 1 2λn Pn i=1 ℓ (mi,λ)yih(xi). Since the loss function ℓis decreasing, ℓ (t) < 0 for all t R, so that αi,λ = ℓ (mi,λ) Pn j=1 ℓ (mj,λ) is positive for all i. Now, K2 λ = 1 Pn i=1 αi,λh(xi) 2 2 ; since Pn i=1 αi,λ = 1, it is well-known that 1/n Pn i=1 α2 i,λ 1. From now one, we can thus focus on the behaviour of the weights αiλ specifically. Proposition 2. Suppose that h(xj) is not a support vector of the limiting margin maximising hyperplane β , then αj,λ 0 as λ 0. On the other hand, if h(xi) is a support vector, then αi,λ is bounded by below. Proof. Since βλ,1 converges to a margin-maximising hyperplane and by continuity of the minimal margin in β, this entails that there exists λ > 0 such that for any λ < λ and for all i = 1, , n, mi,λ 0. Similarly, given that β corresponds to a margin-maximising hyperplane, it holds that β = K Pn i=1 αi, yih(xi), where αi, > 0 if h(xi) is a support vector (in other words, on the boundary of the slab) and αi, = 0 otherwise (this can be obtained via the dual approach to the margin maximisation problem, see (Vapnik, 1998) or Section 4.5.2. in (Hastie et al., 2009)). By assumption and given the loss minimising property, we have the convergence of the normalised solution vector to the margin-maximising weight vector β as λ 0: βλ,1 β . But this is equivalent to Kλαi,λ K αi, . In particular, for non support vectors, this means αj,λ 0. Let us now show that for any support vector h(xi), its associated coefficient αi,λ is bounded for λ small enough. Indeed, since Kλαi,λ K αi > 0, then, for any δ > 0, there exists λ such that, for any λ λ , Kλαi,λ K αi, 2 2 δ. This distinction between support and non-support vectors will now help us characterise the behaviour of the loss function. Proposition 3. For any non-support vector h(xj) and any support vector h(xi), it holds ℓ (mj,λ) ℓ (mi,λ) 0, (7) Proof. Let us pick i such that h(xi) is a support vector and j such that h(xj) is not a support vector. Then αj,λ = ℓ (mj,λ) Pn k=1 ℓ (mk,λ) = ℓ (mj,λ) ℓ (mi,λ) ℓ (mi,λ) Pn k=1 ℓ (mk,λ) = ℓ (mj,λ) ℓ (mi,λ)αi,λ. Narrow Margins: Classification, Margins and Fat Tails Since αi,λ is bounded by below, αj,λ 0 implies that ℓ (mj,λ) ℓ (mi,λ) 0. We are now in a position to characterise the limiting property and tail behaviour of the ratio of the derivative of the loss function. Proposition 4. Consider a non-support vector h(xj) and a support vector h(xi), with respective margins mj, , mi, and let ϵ (0, mj, mi, lim t + ℓ (t(1 µ)) ℓ (t) = + , (8) where µ = mj, mi, 2ϵ mj, ϵ (0, 1). Proof. Observe that we can rewrite the margin as mk,λ = βλ 2ykβT λ,1h(xk) := βλ 2 mk,λ,1 for k = 1, , n. In other words, mk,λ,1 is the normalised margin. By convergence of the normalised weight vector and by continuity of the margin, we have that mk,λ,1 mk, := ykβT h(xk). Since i is a support vector but j isn t, it comes mi, < mj, . Hence, for any 0 < ϵ < mj, mi, 2 , there exists λ > 0 such that for all λ λ , 0 < mi, ϵ mi,λ,1 mi, + ϵ < mj, ϵ mj,λ,1 mj, + ϵ. Since ℓis convex, it comes that ℓ is non-decreasing, hence the key inequality: 0 ℓ ( βλ 2 (mj, ϵ)) ℓ ( βλ 2 (mi, + ϵ)) ℓ (mj,λ) ℓ (mi,λ). (9) But, as λ 0, βλ 2 + (since for λ < λ, all margins are positive but ℓ(t) > 0, βλ 2 must diverge as λ 0) and ℓ (mj,λ) ℓ (mi,λ) 0 thanks to Proposition 3. This now implies that lim t + ℓ (t(mj, ϵ)) ℓ (t(mi, + ϵ)) = 0. By continuity of ℓ , this is equivalent to lim t + ℓ (t(1 µ)) ℓ (t) = + , (10) where µ = mj, mi, 2ϵ mj, ϵ (0, 1). It now remains to derive a similar result for all µ (0, 1) and for ℓrather than ℓ . Proposition 5. Under the assumptions of this section, it holds lim t + ℓ(t(1 µ)) ℓ(t) = + , (11) for any µ (0, 1). Proof. This result of Proposition 4 holds for any positive margins mi, , mj, such that 0 < mi, < mj, and any 0 < ϵ < mj, mi, 2 , hence, for any µ (0, 1), it must hold that limt + ℓ (t(1 µ)) ℓ (t) = + . This is not quite the desired result, but, since limt + ℓ(t) = 0 and ℓis differentiable (and such that ℓ (t) = 0 for t > 0 given that ℓis decreasing), we can apply L Hospital s rule to get limt + ℓ(t(1 µ)) ℓ(t) = (1 µ) limt + ℓ (t(1 µ)) ℓ (t) = + , for any µ (0, 1). Remark 1. Let us point out that the same analysis can be conducted, albeit coordinate by coordinate, for any p > 1 (i.e., as long as the p-norm is differentiable). 3. Functional characterisation of the loss ℓ We have shown that the condition on the convergence to infinity of the ratio ℓ((1 ε)t)/ℓ(t) was a necessary (under strict assumptions) and sufficient (under mild assumptions) condition for the convergence of the normalised regularised estimator βλ,1 to a margin-maximising solution. In this Section, we are however interested in understanding the case where this ratio criterion in Eq. (3) does not hold and the consequences in terms of loss function. From now on, we thus suppose that there exists ε (0, 1), such that lim t + ℓ((1 ε)t) ℓ(t) = γ = + (12) We make the assumption, as in the previous section, that for any a > 0, limt + ℓ(at) ℓ(t) R+, i.e., the limit exist but can be + or a non-negative real. This assumption is made for the sake of simplicity but can be very modified, see Section 3.3. For the sake of clarity, let us now denote η := 1 ε. 3.1. The ratio ℓ(at)/ℓ(t) has a limit for all a > 0 Proposition 6. For any n Z, limt + ℓ(ηnt) Proof. Given that limt + ℓ(ηt) ℓ(t) = γ 1, we also have limt + ℓ(t) ℓ(ηt) = 1 γ , and by continuity, limt + ℓ(t/η) 1 γ . If we consider the case a = ηn, where n N , we can write ℓ(at) ℓ(t) = ℓ(ηnt) But we observe that, by continuity, limt + ℓ(ηi+1t) limt + ℓ(η t) ℓ(t) = γ, thus leading to limt + ℓ(ηnt) ℓ(t) = γn. Bringing those two facts together, the result holds. Narrow Margins: Classification, Margins and Fat Tails Proposition 7. There exists a function ρ : R + R + such that lim t + ℓ(at) ℓ(t) = ρ(a). (13) In particular, ρ(a) > 0 for any positive a. Proof. For any 0 < a < η, there exists na N such that ηna+1 a < ηna, so that ℓ(t) ℓ(ηna+1t) Based on our previous results (and the earlier assumption of the limit s existence in R+), this implies that γna limt + ℓ(at) ℓ(t) γna+1. The case a η is handled similarly, since there exists na N such that η na+1 a < η na. 3.2. ℓas regularly-varying function To make use of this result, let us start by briefly recalling some fundamentals of regularly varying function theory (see (Bingham et al., 1987) for all results mentioned here related to regularly-varying functions). Definition 1. A (measurable) function L : R + R + is said to be slowly varying (at infinity) if, for all a > 0, lim t + L(at) Similarly, we can introduce regularly varying functions: Definition 2. A (measurable) function h : R + R + is said to be regularly varying (at infinity) if, for all a > 0, lim t + h(at) h(t) = ρ(a), where ρ(a) is finite but non-zero for every a > 0. This is exactly the setup that we established in the previous subsection, in particular in Proposition 7. One of the cornerstones of the theory of regularly varying functions is Karamata s characterisation theorem. Theorem 2. Every regularly varying function h : R + R + is of the form h(t) = tζL(t), where ζ R and L is a slowly varying function. In particular, it comes directly that limt + h(at) h(t) = aζ. This implies that the limit function ρ is uniquely defined as ρ(a) = aζ and can only be a power function. Note that a closely related result is Karamata s representation theorem, giving a precise representation of slowly varying functions. In our case, we can thus conclude that if ℓdoes not verify the ratio criterion, then ℓis a regularly varying function, and it is straightforward to infer that log(η). (14) Since we have η (0, 1) and γ 1, ζ 0. Note that ζ = 0 if and only if γ = 1, in which case the loss function ℓis slowly varying. Since ζ is non-positive, we generally consider ξ = ζ rather than ζ directly. While the characterisation and representation of the loss function are interesting results in their own right, it is possible to make them more intuitive by adopting a probabilistic viewpoint. 3.3. Discussion Before moving forward, let us pause briefly to discuss the assumption made to obtain this Section s results. Our assumption is that, for any a > 0, limt + ℓ(at) ℓ(t) R+. This is a global assumption which, coupled with Eq. 12, implies that the limit must then be finite everywhere. However, this assumption does not require any additional finiteness condition. The global aspect of the assumption can be significantly weakened if one posits that there exists at least another point such that the limit exists and is finite. Our result, in this case, is as follows: Theorem 3. Let a1, a2 R + {1} such that log a1 and limt + ℓ(ait) ℓ(t) = ρ(ai) < + , for i = 1, 2, then for any a > 0, limt + ℓ(at) ℓ(t) = ρ(a), where, for any a > 0, ρ(a) = aζ for some ζ R. Proof. Given that ℓis non-negative and non-increasing, we can apply Theorem K in (Seneta, 2002) to the function g defined as g(u) := log ℓ(eu) for u R. The condition log a1 log a2 / Q may, however, not be obvious to check. To summarise, the main takeaway is that Eq. 12, on its own, is in general not enough to guarantee that ℓis regularly varying and an additional assumption is required. 4. Probabilistic interpretation of the characterisation result In this section, we recall the latent interpretation of binary classification and in particular discuss the assumption of symmetry of the latent variable and its inherent limitation. This approach will then be applied to regularly varying losses and distributions in the next section. 4.1. Latent interpretation of classification It is sometimes useful to posit a threshold model whereby a variable εi is unobservable but such that the observed class Narrow Margins: Classification, Margins and Fat Tails label yi { 1, +1} is given by 1{yi= 1} = 1{βT h(xi)+εi<0}. (15) The component βT h(xi) is observed but the εi s are random perturbations (usually considered to be independent and identically distributed). This leads directly to P(yi = 1|xi, β) = F( βT h(xi)) P(yi = +1|xi, β) = 1 F( βT h(xi)), with F the cumulative distribution function ( c.d.f. ) of ε. In particular, under the assumption that F is symmetric (whereby 1 F(t) = F( t) for all t R), then one can succinctly rewrite the probability of observing class y as P(y|xi, β) = F(yβT h(xi)), (16) for y { 1, +1} and the likelihood of the sample is then L({xi, yi}n i=1; β) = i=1 F(yiβT h(xi)). Hence, maximising the likelihood is equivalent to minimising L({xi, yi}n i=1; β) = 1 i=1 log F(yiβT h(xi)) . One can thus define in a straightforward way ℓ(t) = log(F(t)). Now, given a loss function ℓ, can we find a corresponding c.d.f. F? 4.2. Characterisation of losses with latent interpretation For F to be a valid c.d.f., F must be non-negative, right continuous with left limits, non-decreasing and verify limt F(t) = 0, limt + F(t) = 1. These conditions are guaranteed if ℓis continuous, non-increasing and has limit + in and 0 in + . However, the key assumption is that of symmetry, which is difficult to obtain. Proposition 8. Under the assumptions of non-negativity, non-increasingness and continuity, the loss function ℓcan be expressed as a rescaled cumulative distribution function if and only if it verifies the following functional equation: ℓ(0) + 2 ℓ( t) ℓ(0) = 1, (17) for all t R. In this case, F(t) = e βℓ(t) with β = log 2 The proof is very simple but this result is a negative one in the sense that not all loss functions can be written as ℓ(t) = log(F(t)) for a symmetric F. A counterexample is the exponential loss function ℓExponential, leading to F(t) = e e t, which is a valid c.d.f. (namely that of a Gumbel distribution) but is not symmetric. 5. Regularly varying latent distributions 5.1. Brief overview A concept that is closely related to that of regularly varying functions is that of regularly varying distributions (see (Cooke et al., 2014) for an introduction to the topic), which its probabilistic equivalent insofar as it characterises the tails of distributions. Definition 3. A cumulative distribution function F is called regularly varying at infinity with tail index ξ (0, + ) if F(t) = a ξ, (18) for any a > 0, where F = 1 F is the survival function. It is interesting to notice that for a > 1, limt + F (t) = P(X > at|X > t), where X F. 5.2. Loss function and latent distribution tail behaviours Under the assumption that ℓis differentiable (or equivalently that F is differentiable, hence admits a probability density function f), we have lim t + ℓ(at) ℓ(t) = a lim t + ℓ (at) = a lim t + F(t) F(at) f(at) = a lim t + f(at) whence limt + f(at) f(t) = a (ξ+1). Similarly, since F (t) = f(t), it comes F(t) = a lim t + f(at) In other words, we have shown that the loss function ℓand its associated latent distribution F have the same tail index. Proposition 9. If F is regularly varying with tail index ξ and is differentiable (i.e., admits a probability density function f), then f is regularly varying with tail index ξ + 1 and the associated loss function ℓ:= log F is regularly varying with tail index ξ. This is an important result linking the tail behaviour of the loss function to that of the underlying latent variable. One can understand the convergence (or not) towards a margin maximiser in terms of the distributional properties of unobservable individual noise. We have thus connected the Narrow Margins: Classification, Margins and Fat Tails problem of convergence to a separating margin maximising hyperplane and heavy tails. Given Proposition 8, we can now produce loss functions with different behaviours based on different underlying tail indices. 5.3. Some examples Let us now provide some concrete examples of latent distributions that are regularly varying. We restrict ourselves to the class of elliptical distributions for the sake of clarity (see (Anderson, 2004) for a textbook treatment), which still covers the majority of known use cases. We illustrate in Figure 1 the evolution of the ratio ℓ(at)/ℓ(t) for different types of distribution (with different tail behaviours); as per Figure 2, this is connected to tail behaviour of the underlying loss function and distribution. The case of the Gaussian and logistic distributions has already been tackled in Sections 1.2 and 1.3. 5.3.1. CAUCHY DISTRIBUTION The probability density function of a (standard) Cauchy distribution is given by f Cauchy(t) = 1 π(1 + t2). Its c.d.f. is FCauchy(t) = 1 π arctan (t), hence ℓCauchy(t) = log 1 π arctan (t) . (19) From the fact that limt + f Cauchy(at) f Cauchy(t) = a 1, we infer that ξCauchy = 0, i.e., the Cauchy distribution is slowly varying, and so is ℓCauchy. 5.3.2. STUDENT-t DISTRIBUTION The Cauchy distribution is actually a particular case of a Student-t distribution, whose p.d.f. reads f Student(t) = Γ ν+1 where ν 1 the number of degrees of freedom is a parameter governing the tail behaviour (ν = 1 corresponds to the Cauchy case and ν = + to the Gaussian one). We infer that the Student-t distribution has tail index ξStudent = ν 1 2 , whence it has regularly varying tails for ν > 1. We also deduce that FStudent(t) = 1 1 2 , and ℓStudent(t) = log 1 1 2 , where x(t) := ν t2+ν and I is the regularised incomplete beta function. Let us point out that the heaviness of the Student-t distribution s tails has been found to be an interesting feature, for example by considered as in (Shah et al., 2014) Student-t processes instead of Gaussian ones. Figure 1. Evolution of the ratio ℓ(at)/ℓ(t) as a function of t for loss functions associated respectively with the normal, logistic, Student (with ν = 2 degrees of freedom) and Cauchy distributions, and a = 0.0001. We can see (in Figure 1) that the ratio statistic t 7 ℓ(at)/ℓ(t) explodes quickly in the Gaussian case, less quickly in the logistic case and converges in the Student and Cauchy examples. Heavy tails play a crucial role in robustness in statistics and machine learning (cf. (Hsu and Sabato, 2016; Huber and Ronchetti, 2009)) and show that loss functions may reveal different tail behaviours (cf. Figure 2) that can have an impact on an algorithm s performance. 6. Conclusion The primary focus of the work (Freund and Schapire, 1997; Friedman et al., 2000; Rosset et al., 2003; 2004; Schapire et al., 1997) that spurred the present paper was the relationship between support vector machines and regularisation, and their respective benefits and drawbacks. To some extent, the limiting criterion in Eq. (3) has been shown to be a necessary condition too; but further research is warranted to weaken assumptions. More importantly, we have considered the margin maximisation property of classifiers (such as support vector machines (Vapnik, 1998)) as a benchmark for classification tasks and endeavoured to determine the properties of loss functions that do not lead to the convergence of the normalised regularised estimator to a margin maximising classifier. Surprisingly, this is the case (under mild assumptions) if and only if the loss function is regularly varying, which is equivalent to the underlying latent distribution having heavy tails. Our results, while giving a precise quantitative characterisation, are more qualitative in nature, in the sense that they highlight two possible regimes in terms of behaviour of Narrow Margins: Classification, Margins and Fat Tails Figure 2. Tail behaviours of the respective loss functions ℓ(t) = log F(t), in the case of the normal, logistic, Student (with ν = 2 degrees of freedom) and Cauchy distributions. the normalised and regularised classifier. While usual loss functions that as the exponential, hinge, Probit or logistic loss have a similar behaviour (in terms of convergence in the separable case), heavy tailed loss functions have fundamentally different properties. From a more practical perspective, it also shows that relying on usual loss functions may be actually less innocuous than anticipated, in the sense that it assumes that there are no heavy tails. This finding is not limited to purely linear or dictionary learning models, but extends to all methods using a margin-dependent loss function (i.e., the dictionary H need not be fixed). Heavy tails are a growing and exciting part of the recent machine learning literature (Hsu and Sabato, 2016; Lugosi and Mendelson, 2019) and distribution estimation (Ben-Hamou et al., 2017), and open interesting perspectives for real-life data as the presence of heavy tails is well-documented (Taleb, 2020). Some questions remain around the application of these insights to multi-class classification and the impact of regularly varying loss functions in other settings such as deep neural networks or Gaussian Processes for classification. Acknowledgements The author wishes to acknowledge discussions with Dr Andrea Macrina and suggestions made by the anonymous referees that have greatly contributed to improving this paper. Theodore W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley, third edition, 2004. Peter Bartlett, Michael I. Jordan, and Jon D. Mc Auliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473), 2006. Anna Ben-Hamou, St ephane Boucheron, and Mesrob I. Ohannessian. Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications. Bernoulli, 23(1):249 287, 2017. N. H. Bingham, C. M. Goldie, and J. L. Teugels. Regular Variation. Encyclopedia of Mathematics and its Applications. Cambridge University Press, 1987. doi: 10.1017/CBO9780511721434. Christian Brownlees, Emilien Joly, and G abor Lugosi. Empirical risk minimization for heavy-tailed losses. Ann. Statist., 43(6):2507 2536, 12 2015. doi: 10.1214/ 15-AOS1350. URL https://doi.org/10.1214/ 15-AOS1350. Samprit Chatterjee and Ali S. Hadi. Influential observations, high leverage points, and outliers in linear regression. Statist. Sci., 1(3):379 393, 08 1986. doi: 10.1214/ss/ 1177013622. URL https://doi.org/10.1214/ ss/1177013622. Roger M. Cooke, Daan Nieboer, and Jolanta Misiewicz. Regularly Varying and Subexponential Distributions, chapter 4, pages 49 63. John Wiley Sons, Ltd, 2014. ISBN 9781119054207. doi: https://doi.org/10.1002/9781119054207.ch4. URL https://onlinelibrary.wiley.com/doi/ abs/10.1002/9781119054207.ch4. Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci., 55(1):119 139, August 1997. ISSN 0022-0000. doi: 10.1006/jcss.1997.1504. URL https://doi.org/10.1006/jcss.1997. 1504. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive Logistic Regression: a Statistical View of Boosting. The Annals of Statistics, 38(2), 2000. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, second edition, 2009. Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails. Journal of Machine Learning Research, 17(18):1 40, 2016. URL http: //jmlr.org/papers/v17/14-273.html. Peter J. Huber and Elvezio M. Ronchetti. Robust Statistics. Wiley, second edition, 2009. Narrow Margins: Classification, Margins and Fat Tails Marat Ibragimov, Rustam Ibragimov, and Johan Walden. Heavy-tailed distributions and robustness in economics and finance, volume 214 of Lecture Notes in Statistics. Springer, 2015. doi: 10.1007/978-3-319-16877-7. Gabor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19:1145 1190, 2019. O. L. Mangasarian. Arbitrary-norm separating plane. Oper. Res. Lett., 24(1 2):15 23, February 1999. ISSN 0167-6377. doi: 10.1016/S0167-6377(98) 00049-2. URL https://doi.org/10.1016/ S0167-6377(98)00049-2. Saharon Rosset, Ji Zhu, and Trevor Hastie. Margin maximizing loss functions. In Proceedings of the 16th International Conference on Neural Information Processing Systems, NIPS 03, page 1237 1244, Cambridge, MA, USA, 2003. MIT Press. Saharon Rosset, Ji Zhu, and Trevor Hastie. Boosting as a regularized path to a maximum margin classifier. J. Mach. Learn. Res., 5:941 973, December 2004. ISSN 1532-4435. Robert E. Schapire, Yoav Freund, Peter Barlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. In Proceedings of the Fourteenth International Conference on Machine Learning, ICML 97, page 322 330, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc. ISBN 1558604863. Eugene Seneta. Karamata s characterization theorem, feller, and regular variation in probability theory. Publications de l institut mathematique, 71(85):79 89, 2002. Amar Shah, Andrew Wilson, and Zoubin Ghahramani. Student-t Processes as Alternatives to Gaussian Processes. In Samuel Kaski and Jukka Corander, editors, Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pages 877 885, Reykjavik, Iceland, 22 25 Apr 2014. PMLR. URL http:// proceedings.mlr.press/v33/shah14.html. Nassim Nicholas Taleb. Statistical consequences of fat tails: Real world preasymptotics, epistemology, and applications, 2020. URL https://arxiv.org/pdf/ 2001.10488.pdf. Vladimir N. Vapnik. Statistical Learning Theory. Wiley Interscience, 1998. Hansheng Wang, Guodong Li, and Guohua Jiang. Robust regression shrinkage and consistent variable selection through the lad-lasso. Journal of Business & Economic Statistics, 25(3):347 355, 2007. doi: 10.1198/ 073500106000000251.