# learning_with_average_topk_loss__ed7f8dd5.pdf

Learning with Average Top-k Loss

Yanbo Fan3,4,1 , Siwei Lyu1 , Yiming Ying2 , Bao-Gang Hu3,4

1Department of Computer Science, University at Albany, SUNY 2Department of Mathematics and Statistics, University at Albany, SUNY 3National Laboratory of Pattern Recognition, CASIA 4University of Chinese Academy of Sciences (UCAS) {yanbo.fan,hubg}@nlpr.ia.ac.cn, slyu@albany.edu, yying@albany.edu

In this work, we introduce the average top-k (ATk) loss as a new aggregate loss for supervised learning, which is the average over the k largest individual losses over a training dataset. We show that the ATk loss is a natural generalization of the two widely used aggregate losses, namely the average loss and the maximum loss, but can combine their advantages and mitigate their drawbacks to better adapt to different data distributions. Furthermore, it remains a convex function over all individual losses, which can lead to convex optimization problems that can be solved effectively with conventional gradient-based methods. We provide an intuitive interpretation of the ATk loss based on its equivalent effect on the continuous individual loss functions, suggesting that it can reduce the penalty on correctly classiﬁed data. We further give a learning theory analysis of MATk learning on the classiﬁcation calibration of the ATk loss and the error bounds of ATk-SVM. We demonstrate the applicability of minimum average top-k learning for binary classiﬁcation and regression using synthetic and real datasets.

1 Introduction

Supervised learning concerns the inference of a function f : X 7 Y that predicts a target y Y from data/features x X using a set of labeled training examples {(xi, yi)}n i=1. This is typically achieved by seeking a function f that minimizes an aggregate loss formed from individual losses evaluated over all training samples.

To be more speciﬁc, the individual loss for a sample (x, y) is given by ℓ(f(x), y), in which ℓis a nonnegative bivariate function that evaluates the quality of the prediction made by function f. For example, for binary classiﬁcation (i.e., yi { 1}), commonly used forms for individual loss include the 0-1 loss, Iyf(x) 0, which is 1 when y and f(x) have different sign and 0 otherwise, the hinge loss, max(0, 1 yf(x)), and the logistic loss, log2(1 + exp( yf(x))), all of which can be further simpliﬁed as the so-called margin loss, i.e., ℓ(y, f(x)) = ℓ(yf(x)). For regression, squared difference (y f(x))2 and absolute difference |y f(x)| are two most popular forms for individual loss, which can be simpliﬁed as ℓ(y, f(x)) = ℓ(|y f(x)|). Usually the individual loss is chosen to be a convex function of its input, but recent works also propose various types of non-convex individual losses (e.g., [10, 15, 27, 28]).

The supervised learning problem is then formulated as minf {L(Lz(f)) + Ω(f)}, where L(Lz(f)) is the aggregate loss accumulates all individual losses over training samples, i.e., Lz(f) = {ℓi(f)}n i=1, with ℓi(f) being the shorthand notation for ℓ(f(xi), yi), and Ω(f) is the regularizer on f. However, in contrast to the plethora of the types of individual losses, there are only a few choices when we consider the aggregate loss:

Corresponding author.

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

-1 0 1 2 3 -1

3 Classification Boundary

1 10 100 200 k

0.1 Misclassification Rate

-1 0 1 2 3 -1

3 Classification Boundary

1 10 100 200 k

0.03 Misclassification Rate

-1 0 1 2 3 -1

3 Classification Boundary

1 10 100 200 k

0.025 Misclassification Rate

-1 0 1 2 3 -1

3 Classification Boundary

1 10 100 200 k

0.025 Misclassification Rate

Figure 1: Comparison of different aggregate losses on 2D synthetic datasets with n = 200 samples for binary classiﬁcation on a balanced but multi-modal dataset and with outliers (top) and an imbalanced dataset with outliers (bottom) with logistic loss (left) and hinge loss (right). Outliers in data are shown as an enlarged and the optimal Bayes classiﬁcations are shown as shaded areas. The ﬁgures in the second and fourth columns show the misclassiﬁcation rate of ATk vs. k for each case.

the average loss: Lavg(Lz(f)) = 1

n Pn i=1 ℓi(f), i.e., the mean of all individual losses; the maximum loss: Lmax(Lz(f)) = max1 k n ℓi(f), i.e., the largest individual loss; the top-k loss [20]: Ltop-k(Lz(f)) = ℓ[k](f)2 for 1 k n, i.e., the k-th largest (top-k) individual loss. The average loss is unarguably the most widely used aggregate loss, as it is a unbiased approximation to the expected risk and leads to the empirical risk minimization in learning theory [1, 7, 22, 25, 26]. Further, minimizing the average loss affords simple and efﬁcient stochastic gradient descent algorithms [3, 21]. On the other hand, the work in [20] shows that constructing learning objective based on the maximum loss may lead to improved performance for data with separate typical and rare subpopulations. The top-k loss [20] generalizes the maximum loss, as Lmax(Lz(f)) = Ltop-1(Lz(f)), and can alleviate the sensitivity to outliers of the latter. However, unlike the average loss or the maximum loss, the top-k loss in general does not lead to a convex learning objective, as it is not convex of all the individual losses Lz(f).

In this work, we propose a new type of aggregate loss that we term as the average top-k (ATk) loss, which is the average of the largest k individual losses, that is deﬁned as: Lavt-k(Lz(f)) = 1

k Pk i=1 ℓ[i](f). (1)

We refer to learning objectives based on minimizing the ATk loss as MATk learning.

The ATk loss generalizes the average loss (k = n) and the maximum loss (k = 1), yet it is less susceptible to their corresponding drawbacks, i.e., it is less sensitive to outliers than the maximum loss and can adapt to imbalanced and/or multi-modal data distributions better than the average loss. This is illustrated with two toy examples of synthesized 2D data for binary classiﬁcation in Fig.1 (see supplementary materials for a complete illustration). As these plots show, the linear classiﬁer obtained with the maximum loss is not optimal due to the existence of outliers while the linear classiﬁer corresponding to the average loss has to accommodate the requirement to minimize individual losses across all training data, and sacriﬁces smaller sub-clusters of data (e.g., the rare population of + class in the top row and the smaller dataset of class in the bottom row). In contrast, using ATk loss with k = 10 can better protect such smaller sub-clusters and leads to linear classiﬁers closer to the optimal Bayesian linear classiﬁer. This is also corroborated by the plots of corresponding misclassiﬁcation rate of ATk vs. k value in Fig.1, which show that minimum misclassiﬁcation rates occur at k value other than 1 (maximum loss) or n (average loss).

The ATk loss is a tight upper-bound of the top-k loss, as Lavt-k(Lz(f)) Ltop-k(Lz(f)) with equality holds when k = 1 or ℓi(f) = constant, and it is a convex function of the individual losses (see Section 2). Indeed, we can express ℓ[k](f) as the difference of two convex functions k Lavt-k(Lz(f)) (k 1)Lavt-(k 1)(Lz(f)), which shows that in general Ltop-k(Lz(f)) is not convex with regards to the individual losses.

2We deﬁne the top-k element of a set S = {s1, , sn} as s[k], such that s[1] s[2] s[n].

In sequel, we will provide a detailed analysis of the ATk loss and MATk learning. First, we establish a reformulation of the ATk loss as the minimum of the average of the individual losses over all training examples transformed by a hinge function. This reformulation leads to a simple and effective stochastic gradient-based algorithm for MATk learning, and interprets the effect of the ATk loss as shifting down and truncating at zero the individual loss to reduce the undesirable penalty on correctly classiﬁed data. When combined with the hinge function as individual loss, the ATk aggregate loss leads to a new variant of SVM algorithm that we term as ATk SVM, which generalizes the C-SVM and the ν-SVM algorithms [19]. We further study learning theory of MATk learning, focusing on the classiﬁcation calibration of the ATk loss function and error bounds of the ATk SVM algorithm. This provides a theoretical lower-bound for k for reliable classiﬁcation performance. We demonstrate the applicability of minimum average top-k learning for binary classiﬁcation and regression using synthetic and real datasets.

The main contributions of this work can be summarized as follows.

We introduce the ATk loss for supervised learning, which can balance the pros and cons of the average and maximum losses, and allows the learning algorithm to better adapt to imbalanced and multi-modal data distributions. We provide algorithm and interpretation of the ATk loss, suggesting that most existing learning algorithms can take advantage of it without signiﬁcant increase in computation. We further study the theoretical aspects of ATk loss on classiﬁcation calibration and error bounds of minimum average top-k learning for ATk-SVM. We perform extensive experiments to validate the effectiveness of the MATk learning.

2 Formulation and Interpretation

The original ATk loss, though intuitive, is not convenient to work with because of the sorting procedure involved. This also obscures its connection with the statistical view of supervised learning as minimizing the expectation of individual loss with regards to the underlying data distribution. Yet, it affords an equivalent form, which is based on the following result.

Lemma 1 (Lemma 1, [16]). Pk i=1 x[i] is a convex function of (x1, , xn). Furthermore, for xi 0 and i = 1, , n, we have Pk i=1 x[i] = minλ 0 kλ + Pn i=1 [xi λ]+ , where [a]+ = max{0, a} is the hinge function.

For completeness, we include a proof of Lemma 1 in supplementary materials. Using Lemma 1, we can reformulate the ATk loss (1) as

Lavt-k(Lz(f)) = 1

i=1 ℓ[i](f) min λ 0

i=1 [ℓi(f) λ]+ + k

In other words, the ATk loss is equivalent to minimum of the average of individual losses that are shifted and truncated by the hinge function controlled by λ. This sheds more lights on the ATk loss, which is particularly easy to illustrate in the context of binary classiﬁcation using the margin losses, ℓ(f(x), y) = ℓ(yf(x)).

In binary classiﬁcation, the gold standard of individual loss is the 0-1 loss Iyf(x) 0, which exerts a constant penalty 1 to examples that are misclassiﬁed by f and no penalty to correctly classiﬁed examples. However, the 0-1 loss is difﬁcult to work as it is neither continuous nor convex. In practice, it is usually replaced by a surrogate convex loss. Such convex surrogates afford efﬁcient algorithms, but as continuous and convex upper-bounds of the 0-1 loss, they typically also penalize correctly classiﬁed examples, i.e., for y and x that satisfy yf(x) > 0, ℓ(yf(x)) > 0, whereas Iyf(x) 0 = 0 (Fig.2). This implies that when the average of individual losses across all training examples is minimized, correctly classiﬁed examples by f that are too close to the classiﬁcation boundary may be sacriﬁced to accommodate reducing the average loss, as is shown in Fig.1.

In contrast, after the individual loss is combined with the hinge function, i.e., [ℓ(yf(x)) λ]+ with λ > 0, it has the effect of shifting down the original individual loss function and truncating it at zero, see Fig.2. The transformation of the individual loss reduces penalties of all examples, and in particular beneﬁts correctly classiﬁed data. In particular, if such examples are far enough from the decision boundary, like in the 0-1 loss, their penalty becomes zero. This alleviates the likelihood of misclassiﬁcation on those rare sub-populations of data that are close to the decision boundary.

-1.5 -1 -0.5 0 0.5 1 1.5 0

Figure 2: The ATk loss interpreted at the individual loss level. Shaded area corresponds to data/target with correct classiﬁcation.

Algorithm: The reformulation of the ATk loss in Eq.(2) also facilitates development of optimization algorithms for the minimum ATk learning. As practical supervised learning problems usually use a parametric form of f, as f(x; w), where w is the parameter, the corresponding minimum ATk objective becomes

i=1 [ℓ(f(xi; w), yi) λ]+ + k

(3) It is not hard to see that if ℓ(f(x; w), y) is convex with respect to w, the objective function of in Eq.(3) is a convex function for w and λ jointly. This leads to an immediate stochastic (projected) gradient descent [3, 21] for solving (3). For instance, with Ω(w) = 1 2C w 2, where C > 0 is a regularization factor, at the t-th iteration, the corresponding MATk objective can be minimized by ﬁrst randomly sampling (xit, yit) from the training set and then updating the parameters as

w(t+1) w(t) ηt wℓ(f(xit; w(t)), yit) I[ℓ(f(xit;w(t)),yit)>λ(t)] + w(t)

λ(t+1) h λ(t) ηt k n I[ℓ(f(xit;w(t),yit)>λ(t)] i

where wℓ(f(x; w), y) denotes the sub-gradient with respect to w, and ηt 1

t is the step size.

ATk-SVM: As a general aggregate loss, the ATk loss can be combined with any functional form for individual losses. In the case of binary classiﬁcation, the ATk loss combined with the individual hinge loss for a prediction function f from a reproducing kernel Hilbert space (RKHS) [18] leads to the ATk-SVM model. Speciﬁcally, we consider function f as a member of RKHS HK with norm K, which is induced from a reproducing kernel K : X X R. Using the individual hinge loss, [1 yif(xi)]+, the corresponding MATk learning objective in RKHS becomes

min f HK,λ 0 1 n

[1 yif(xi)]+ λ

2C f 2 K, (5)

where C > 0 is the regularization factor. Furthermore, the outer hinge function in (5) can be removed due to the following result. Lemma 2. For a 0, b 0, there holds [a ℓ]+ b

+ = [a b ℓ]+.

Proof of Lemma 2 can be found in the supplementary materials. In addition, note that for any minimizer (fz, λz) of (5), setting f(x) = 0, λ = 1 in the objective function of (5), we have k

nλz 1 n Pn i=1 [1 yifz(xi)]+ λz

nλz + 1 2C fz 2 K k

n, so we have 0 λz 1 which means that the minimization can be restricted to 0 λ 1. Using these results and introducing ρ = 1 λ, Eq.(5) can be rewritten as

min f HK,0 ρ 1 1 n

i=1 [ρ yif(xi)]+ k

2C f 2 K. (6)

The ATk-SVM objective generalizes many several existing SVM models. For example, when k = n, it equals to the standard C-SVM [5]. When C = 1 and with conditions K(xi, xi) 1 for any i, ATk-SVM reduces to ν-SVM [19] with ν = k n. Furthermore, similar to the conventional SVM model, writing in the dual form of (6) can lead to a convex quadratic programming problem that can be solved efﬁciently. See supplementary materials for more detailed explanations.

Choosing k. The number of top individual losses in the ATk loss is a critical parameter that affects the learning performance. In concept, using ATk loss will not be worse than using average or maximum losses as they correspond to speciﬁc choices of k. In practice, k can be chosen during training from a validation dataset as the experiments in Section 4. As k is an integer, a simple grid search usually sufﬁces to ﬁnd a satisfactory value. Besides, Theorem 1 in Section 3 establishes a theoretical lower bound for k to guarantee reliable classiﬁcation based on the Bayes error. If we have information about the proportion of outliers, we can also narrow searching space of k based on the fact that ATk loss is the convex upper bound of the top-k loss, which is similar to [20].

3 Statistical Analysis

In this section, we address the statistical properties of the ATk objective in the context of binary classiﬁcation. Speciﬁcally, we investigate the property of classiﬁcation calibration [1] of the ATk general objective, and derive bounds for the misclassiﬁcation error of the ATk-SVM model in the framework of statistical learning theory (e.g. [1, 7, 23, 26]).

3.1 Classiﬁcation Calibration under ATk Loss

We assume the training data z = {(xi, yi)}n i=1 are i.i.d. samples from an unknown distribution p on X { 1}. Let p X be the marginal distribution of p on the input space X. Then, the misclassiﬁcation error of a classiﬁer f : X { 1} is denoted by R(f) = Pr(y = f(x)) = E[Iyf(x) 0]. The Bayes error is given by R = inff R(f), where the inﬁmum is over all measurable functions. No function can achieve less risk than the Bayes rule fc(x) = sign(η(x) 1

2), where η(x) = Pr(y = 1|x) [8].

In practice, one uses a surrogate loss ℓ: R [0, ) which is convex and upper-bound the 0-1 loss. The population ℓ-risk (generalization error) is given by Eℓ(f) = E[ℓ(yf(x))]. Denote the optimal ℓ-risk by E ℓ= inff Eℓ(f). A very basic requirement for using such a surrogate loss ℓis the so-called classiﬁcation calibration (point-wise form of Fisher consistency) [1, 14]. Speciﬁcally, a loss ℓis classiﬁcation calibrated with respect to distribution p if, for any x, the minimizer f ℓ= inff Eℓ(f) should have the same sign as the Bayes rule fc(x), i.e., sign(f ℓ(x)) = sign(fc(x)) whenever fc(x) = 0.

An appealing result concerning the classiﬁcation calibration of a loss function ℓwas obtained in [1], which states that ℓis classiﬁcation calibrated if ℓis convex, differentiable at 0 and ℓ (0) < 0. In the same spirit, we investigate the classiﬁcation calibration property of the ATk loss. Speciﬁcally, we ﬁrst obtain the population form of the ATk objective using the inﬁnite limit of (2)

i=1 [ℓ(yif(xi)) λ]+ + k

k n ν n E [[ℓ(yf(x)) λ]+] + νλ.

We then consider the optimization problem

(f , λ ) = arg inf f,λ 0 E [[ℓ(yf(x)) λ]+] + νλ, (7)

where the inﬁmum is taken over all measurable function f : X R. We say the ATk (aggregate) loss is classiﬁcation calibrated with respect to p if f has the same sign as the Bayes rule fc. The following theorem establishes such conditions. Theorem 1. Suppose the individual loss ℓ: R R+ is convex, differentiable at 0 and ℓ (0) < 0. Without loss of generality, assume that ℓ(0) = 1. Let (f , λ ) be deﬁned in (7),

(i) If ν > E ℓthen the ATk loss is classiﬁcation calibrated.

(ii) If, moreover, ℓis monotonically decreasing and the ATk aggregate loss is classiﬁcation calibrated then ν R

2 min(η(x), 1 η(x))dp X (x).

The proof of Theorem 1 can be found in the supplementary materials. Part (i) and (ii) of the above theorem address respectively the sufﬁcient and necessary conditions on ν such that the ATk loss becomes classiﬁcation calibrated. Since ℓis an upper bound surrogate of the 0-1 loss, the optimal ℓ-risk E ℓis larger than the Bayes error R , i.e., E ℓ R . In particular, if the individual loss ℓ is the hinge loss then E ℓ= 2R . Part (ii) of the above theorem indicates that the ATk aggregate loss is classiﬁcation calibrated if ν = limn k/n is larger than the optimal generalization error E ℓ associated with the individual loss. The choice of k > n E ℓthus guarantees classiﬁcation calibration, which gives a lower bound of k. This result also provides a theoretical underpinning of the sensitivity to outliers of the maximum loss (ATk loss with k = 1). If the probability of the set {x : η(x) = 1/2} is zero, R = R

X min(η(x), 1 η(x))dp X (x) = R

η(x) =1/2 min(η(x), 1 η(x))dp X (x). Theorem 1 indicates that in this case, if the maximum loss is calibrated, one must have 1

n ν R . In other words, as the number of training data increases, the Bayes error has to be arbitrarily small, which is consistent with the empirical observation that the maximum loss works well under the well-separable data setting but are sensitive to outliers and non-separable data.

3.2 Error bounds of ATk-SVM

We next study the excess misclassiﬁcation error of the ATk-SVM model i.e., R(sign(fz)) R . Let (fz, ρz) be the minimizer of the ATk-SVM objective (6) in the RKHS setting. Let f H be the minimizer of the generalization error over the RKHS space HK, i.e., f H = argminf HK Eh(f), where we use the notation Eh(f) = E [[1 yf(x)]+] to denote the ℓ-risk of the hinge loss. In the ﬁnite-dimension case, the existence of f H follows from the direct method in the variational calculus, as Eh( ) is lower bounded by zero, coercive, and weakly sequentially lower semi-continuous by its convexity. For an inﬁnite dimensional HK, we assume the existence of f H. We also assume that Eh(f H) < 1 since even a na ıve zero classiﬁer can achieve Eh(0) = 1. Denote the approximation error by A(HK) = inff HK Eh(f) Eh(fc) = Eh(f H) Eh(fc), and let κ = supx X p

K(x, x). The main theorem can be stated as follows.

Theorem 2. Consider the ATk-SVM in RKHS (6). For any ε (0, 1] and µ (0, 1 Eh(f H)), choosing k = n(Eh(f H) + µ) . Then, it holds

Pr R(sign(fz)) R µ + A(H) + ε + 1 + Cκ,H nµ

2 exp nµ2ε2

(1 + Cκ,H)2 ,

where Cκ,H = κ(2

2C + 4 f H K).

The complete proof of Theorem 2 is given in the supplementary materials. The main idea is to show that ρz is bounded from below by a positive constant with high probability, and then bound the excess misclassiﬁcation error R(sign(f z )) R by Eh(fz/ρz) Eh(fc). If K is a universal kernel then A(HK) = 0 [23]. In this case, let µ = ε (0, 1 Eh(f H)), then from Theorem 2 we have

Pr R(sign(fz)) R 2ε + 1 + Cκ,H nε

(1 + Cκ,H)2 ,

Consequently, choosing C such that limn C/n = 0, which is equivalent to limn (1 + Cκ,H)2/n = 0, then R(sign(fz)) can be arbitrarily close to the Bayes error R , with high probability, as long as n is sufﬁciently large.

4 Experiments

We have demonstrated that ATk loss provides a continuum between the average loss and the maximum loss, which can potentially alleviates their drawbacks. A natural question is whether such an advantage actually beneﬁts practical learning problems. In this section, we demonstrate the behaviors of MATk learning coupled with different individual losses for binary classiﬁcation and regression on synthetic and real datasets, with minimizing the average loss and the maximum loss treated as special cases for k = n and k = 1, respectively. For simplicity, in all experiments, we use homogenized linear prediction functions f(x) = w T x with parameters w and the Tikhonov regularizer Ω(w) = 1 2C ||w||2 , and optimize the MATk learning objective with the stochastic gradient descent method given in (4).

Binary Classiﬁcation: We conduct experiments on binary classiﬁcation using eight benchmark datasets from the UCI3 and KEEL4 data repositories to illustrate the potential effects of using ATk loss in practical learning to adapt to different underlying data distributions. A detailed description of the datasets is given in supplementary materials. The standard individual logistic loss and hinge loss are combined with different aggregate losses. Note that average loss combined with individual logistic loss corresponds to the logistic regression model and average loss combined with individual hinge loss leads to the C-SVM algorithm [5].

For each dataset, we randomly sample 50%, 25%, 25% examples as training, validation and testing sets, respectively. During training, we select parameters C (regularization factor) and k (number of top losses) on the validation set. Parameter C is searched on grids of log10 scale in the range of [10 5, 105] (extended when optimal value is on the boundary), and k is searched on grids of log10 scale in the range of [1, n]. We use k to denote the optimal k selected from the validation set.

3https://archive.ics.uci.edu/ml/datasets.html 4http://sci2s.ugr.es/keel/datasets.php

Logistic Loss Hinge Loss Maximum Average ATk Maximum Average ATk Monk 22.41(2.95) 20.46(2.02) 16.76(2.29) 22.04(3.08) 18.61(3.16) 17.04(2.77) Australian 19.88(6.64) 14.27(3.22) 11.70(2.82) 19.82(6.56) 14.74(3.10) 12.51(4.03) Madelon 47.85(2.51) 40.68(1.43) 39.65(1.72) 48.55(1.97) 40.58(1.86) 40.18(1.64) Splice 23.57(1.93) 17.25(0.93) 16.12(0.97) 23.40(2.10) 16.25(1.12) 16.23(0.97) Spambase 21.30(3.05) 8.36(0.97) 8.36(0.97) 21.03(3.26) 7.40(0.72) 7.40(0.72) German 28.24(1.69) 25.36(1.27) 23.28(1.16) 27.88(1.61) 24.16(0.89) 23.80(1.05) Titanic 26.50(3.35) 22.77(0.82) 22.44(0.84) 25.45(2.52) 22.82(0.74) 22.02(0.77) Phoneme 28.67(0.58) 25.50(0.88) 24.17(0.89) 28.81(0.62) 22.88(1.01) 22.88(1.01) Table 1: Average misclassiﬁcation rate (%) of different learning objectives over 8 datasets. The best results are shown in bold with results that are not signiﬁcant different to the best results underlined.

1 50 100 150 216 k

Misclassification Rate

1 100 200 300 346 k

Misclassification Rate

1 500 1000 1588 k

Misclassification Rate

1 500 1000 1500 2000 2702 k

Misclassification Rate

Figure 3: Plots of misclassiﬁcation rate on testing set vs. k on four datasets.

We report the average performance over 10 random splitting of training/validation/testing for each dataset with MATk learning objectives formed from individual logistic loss and hinge loss. Table 1 gives their experimental results in terms of misclassiﬁcation rate (results in terms of other classiﬁcation quality metrics are given in supplementary materials). Note that on these datasets, the average loss consistently outperforms the maximum loss, but the performance can be further improved with the ATk loss, which is more adaptable to different data distributions. This advantage of the ATk loss is particularly conspicuous for datasets Monk and Australian.

To further understand the behavior of MATk learning on individual datasets, we show plots of misclassiﬁcation rate on testing set vs. k for four representative datasets in Fig.3 (in which C is ﬁxed to 102 and k [1, n]). As these plots show, on all four datasets, there is a clear range of k value with better classiﬁcation performance than the two extreme cases k = 1 and k = n, corresponding to the maximum and average loss, respectively. To be more speciﬁc, when k = 1, the potential noises and outliers will have the highest negative effects on the learned classiﬁer and the related classiﬁcation performance is very poor. As k increases, the negative effects of noises and outliers will reduce and the classiﬁcation performance becomes better, this is more signiﬁcant on dataset Monk, Australian and Splice. However, if k keeps increase, the classiﬁcation performance may decrease (e.g., when k = n). This may because that as k increases, more and more well classiﬁed samples will be included and the non-zero loss on these samples will have negative effects on the learned classiﬁer (see our analysis in Section 2), especiﬁcally for dataset Monk, Australian and Phoneme.

Regression. Next, we report experimental results of linear regression on one synthetic dataset (Sinc) and three real datasets from [4], with a detailed description of these datasets given in supplementary materials. The standard square loss and absolute loss are adopted as individual losses. Note that average loss coupled with individual square loss is standard ridge regression model and average loss coupled with individual absolute loss reduces to ν-SVR [19]. We normalize the target output to [0, 1] and report their root mean square error (RMSE) in Table 2, with optimal C and k obtained by a grid search as in the case of classiﬁcation (performance in terms of mean absolute square error (MAE) is given in supplementary materials). Similar to the classiﬁcation cases, using the ATk loss usually improves performance in comparison to the average loss or maximum loss.

5 Related Works

Most work on learning objectives focus on designing individual losses, and only a few are dedicated to new forms of aggregate losses. Recently, aggregate loss considering the order of training data have been proposed in curriculum learning [2] and self-paced learning [11, 9], which suggest to organize the training process in several passes and samples are included from easy to hard gradually. It is interesting to note that each pass of self-paced learning [11] is equivalent to minimum the average of

Square Loss Absolute Loss Maximum Average ATk Maximum Average ATk Sinc 0.2790(0.0449) 0.1147(0.0060) 0.1139(0.0057) 0.1916(0.0771) 0.1188(0.0067) 0.1161(0.0060) Housing 0.1531(0.0226) 0.1065(0.0132) 0.1050(0.0132) 0.1498(0.0125) 0.1097(0.0180) 0.1082(0.0189) Abalone 0.1544(0.1012) 0.0800(0.0026) 0.0797(0.0026) 0.1243(0.0283) 0.0814(0.0029) 0.0811(0.0027) Cpusmall 0.2895(0.0722) 0.1001(0.0035) 0.0998(0.0037) 0.2041(0.0933) 0.1170(0.0061) 0.1164(0.0062) Table 2: Average RMSE on four datasets. The best results are shown in bold with results that are not signiﬁcant different to the best results underlined.

the k smallest individual losses, i.e., 1

k Pn i=n k+1 ℓ[i](f), which we term it as the average bottom-k loss in contrast to the average top-k losses in our case. In [20], the pros and cons of the maximum loss and the average loss are compared, and the top-k loss, i.e., ℓ[k](f), is advocated as a remedy to the problem of both. However, unlike the ATk loss, in general, neither the average bottom-k loss nor the top-k loss are convex functions with regards to the individual losses.

Minimizing top-k errors has also been used in individual losses. For ranking problems, the work of [17, 24] describes a form of individual loss that gives more weights to the top examples in a ranked list. In multi-class classiﬁcation, the top-1 loss is commonly used which causes penalties when the top-1 predicted class is not the same as the target class label [6]. This has been further extended in [12, 13] to the top-k multi-class loss, in which for a class label that can take m different values, the classiﬁer is only penalized when the correct value does not show up in the top k most conﬁdent predicted values. As an individual loss, these works are complementary to the ATk loss and they can be combined to improve learning performance.

6 Discussion

In this work, we introduce the average top-k (ATk) loss as a new aggregate loss for supervised learning, which is the average over the k largest individual losses over a training dataset. We show that the ATk loss is a natural generalization of the two widely used aggregate losses, namely the average loss and the maximum loss, but can combine their advantages and mitigate their drawbacks to better adapt to different data distributions. We demonstrate that the ATk loss can better protect small subsets of hard samples from being swamped by a large number of easy ones, especially for imbalanced problems. Furthermore, it remains a convex function over all individual losses, which can lead to convex optimization problems that can be solved effectively with conventional gradientbased methods. We provide an intuitive interpretation of the ATk loss based on its equivalent effect on the continuous individual loss functions, suggesting that it can reduce the penalty on correctly classiﬁed data. We further study the theoretical aspects of ATk loss on classiﬁcation calibration and error bounds of minimum average top-k learning for ATk-SVM. We demonstrate the applicability of minimum average top-k learning for binary classiﬁcation and regression using synthetic and real datasets.

There are many interesting questions left unanswered regarding using the ATk loss as learning objectives. Currently, we use conventional gradient-based algorithms for its optimization, but we are investigating special instantiations of MATk learning for which more efﬁcient optimization methods can be developed. Furthermore, the ATk loss can also be used for unsupervised learning problems (e.g., clustering), which is a focus of our subsequent study. It is also of practical importance to combine ATk loss with other successful learning paradigms such as deep learning, and to apply it to large scale real life dataset. Lastly, it would be very interesting to derive error bounds of MATk with general individual loss functions.

7 Acknowledgments

We thank the anonymous reviewers for their constructive comments. This work was completed when the ﬁrst author was a visiting student at SUNY Albany, supported by a scholarship from University of Chinese Academy of Sciences (UCAS). Siwei Lyu is supported by the National Science Foundation (NSF, Grant IIS-1537257) and Yiming Ying is supported by the Simons Foundation (#422504) and the 2016-2017 Presidential Innovation Fund for Research and Scholarship (PIFRS) program from SUNY Albany. This work is also partially supported by the National Science Foundation of China (NSFC, Grant 61620106003) for Bao-Gang Hu and Yanbo Fan.

[1] P. L. Bartlett, M. I. Jordan, and J. D. Mc Auliffe. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

[2] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, pages 41 48, 2009.

[3] O. Bousquet and L. Bottou. The tradeoffs of large scale learning. In NIPS, pages 161 168, 2008.

[4] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. TIST, 2(3):27, 2011.

[5] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273 297, 1995.

[6] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research, 2(Dec):265 292, 2001.

[7] E. De Vito, A. Caponnetto, and L. Rosasco. Model selection for regularized least-squares algorithm in learning theory. Foundations of Computational Mathematics, 5(1):59 85, 2005.

[8] L. Devroye, L. Gy orﬁ, and G. Lugosi. A probabilistic theory of pattern recognition, volume 31. Springer Science & Business Media, 2013.

[9] Y. Fan, R. He, J. Liang, and B.-G. Hu. Self-paced learning: An implicit regularization perspective. In AAAI, pages 1877 1833, 2017.

[10] R. He, W.-S. Zheng, and B.-G. Hu. Maximum correntropy criterion for robust face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8):1561 1576, 2011.

[11] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In NIPS, pages 1189 1197, 2010.

[12] M. Lapin, M. Hein, and B. Schiele. Top-k multiclass SVM. In NIPS, pages 325 333, 2015.

[13] M. Lapin, M. Hein, and B. Schiele. Loss functions for top-k error: Analysis and insights. In CVPR, pages 1468 1477, 2016.

[14] Y. Lin. A note on margin-based loss functions in classiﬁcation. Statistics & probability letters, 68(1):73 82, 2004.

[15] H. Masnadi-Shirazi and N. Vasconcelos. On the design of loss functions for classiﬁcation: theory, robustness to outliers, and savageboost. In NIPS, pages 1049 1056, 2009.

[16] W. Ogryczak and A. Tamir. Minimizing the sum of the k largest functions in linear time. Information Processing Letters, 85(3):117 122, 2003.

[17] C. Rudin. The p-norm push: A simple convex ranking algorithm that concentrates at the top of the list. Journal of Machine Learning Research, 10(Oct):2233 2271, 2009.

[18] B. Sch olkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.

[19] B. Sch olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural computation, 12(5):1207 1245, 2000.

[20] S. Shalev-Shwartz and Y. Wexler. Minimizing the maximal loss: How and why. In ICML, 2016.

[21] N. Srebro and A. Tewari. Stochastic optimization for machine learning. ICML Tutorial, 2010.

[22] I. Steinwart. On the optimal parameter choice for ν-support vector machines. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(10):1274 1284, 2003.

[23] I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media, 2008.

[24] N. Usunier, D. Buffoni, and P. Gallinari. Ranking with ordered weighted pairwise classiﬁcation. In ICML, pages 1057 1064, 2009.

[25] V. Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998.

[26] Q. Wu, Y. Ying, and D.-X. Zhou. Learning rates of least-square regularized regression. Foundations of Computational Mathematics, 6(2):171 192, 2006.

[27] Y. Wu and Y. Liu. Robust truncated hinge loss support vector machines. Journal of the American Statistical Association, 102(479):974 983, 2007.

[28] Y. Yu, M. Yang, L. Xu, M. White, and D. Schuurmans. Relaxed clipping: A global training method for robust regression and classiﬁcation. In NIPS, pages 2532 2540, 2010.