# tighter_bounds_lead_to_improved_classifiers__7d63aaed.pdf

Published as a conference paper at ICLR 2017

TIGHTER BOUNDS LEAD TO IMPROVED CLASSIFIERS

Nicolas Le Roux Criteo Research nicolas@le-roux.name

The standard approach to supervised classiﬁcation involves the minimization of a log-loss as an upper bound to the classiﬁcation error. While this is a tight bound early on in the optimization, it overemphasizes the inﬂuence of incorrectly classiﬁed examples far from the decision boundary. Updating the upper bound during the optimization leads to improved classiﬁcation rates while transforming the learning into a sequence of minimization problems. In addition, in the context where the classiﬁer is part of a larger system, this modiﬁcation makes it possible to link the performance of the classiﬁer to that of the whole system, allowing the seamless introduction of external constraints.

1 INTRODUCTION

Classiﬁcation aims at mapping inputs X X to one or several classes y Y. For instance, in object categorization, X will be the set of images depicting an object, usually represented by the RGB values of each of their pixels, and Y will be a set of object classes, such as car or dog .

We shall assume we are given a training set comprised of N independent and identically distributed labeled pairs (Xi, yi). The standard approach to solve the problem is to deﬁne a parameterized class of functions p(y|X, θ) indexed by θ and to ﬁnd the parameter θ which minimizes the log-loss, i.e.

θ = arg min θ 1

i log p(yi|Xi, θ) (1.1)

= arg min θ Llog(θ) ,

Llog(θ) = 1

i log p(yi|Xi, θ) . (1.2)

One justiﬁcation for minimizing Llog(θ) is that θ is the maximum likelihood estimator, i.e. the parameter which maximizes

θ = arg max θ p(D|θ)

= arg max θ

i p(yi|Xi, θ) .

There is another reason to use Eq. 1.1. Indeed, the goal we are interested in is minimizing the classiﬁcation error. If we assume that our classiﬁers are stochastic and outputs a class according to p(yi|Xi, θ), then the expected classiﬁcation error is the probability of choosing the incorrect classa. This translates to

i (1 p(yi|Xi, θ))

i p(yi|Xi, θ) . (1.3)

a In practice, we choose the class deterministically and output arg maxy p(y|Xi, θ).

Published as a conference paper at ICLR 2017

This is a highly nonconvex function of θ, which makes its minimization difﬁcult. However, we have

i p(yi|Xi, θ)

1 K (1 + log p(yi|Xi, θ) + log K)

= (K 1 log K)

K + Llog(θ)

where K = |Y| is the number of classes (assumed ﬁnite), using the fact that, for every nonnegative t, we have t 1 + log t. Thus, minimizing Llog(θ) is equivalent to minimizing an upper bound of L(θ). Further, this bound is tight when p(yi|Xi, θ) = 1 K for all yi. As a model with randomly initialized parameters will assign probabilities close to 1/K to each class, it makes sense to minimize Llog(θ) rather than L(θ) early on in the optimization.

However, this bound becomes looser as θ moves away from its initial value. In particular, poorly classiﬁed examples, for which p(yi|Xi, θ) is close to 0, have a strong inﬂuence on the gradient of Llog(θ) despite having very little inﬂuence on the gradient of L(θ). The model will thus waste capacity trying to bring these examples closer to the decision boundary rather than correctly classifying those already close to the boundary. This will be especially noticeable when the model has limited capacity, i.e. in the underﬁtting setting.

Section 2 proposes a tighter bound of the classiﬁcation error as well as an iterative scheme to easily optimize it. Section 3 experiments this iterative scheme using generalized linear models over a variety of datasets to estimate its impact. Section 4 then proposes a link between supervised learning and reinforcement learning, revisiting common techniques in a new light. Finally, Section 5 concludes and proposes future directions.

2 TIGHTER BOUNDS ON THE CLASSIFICATION ERROR

We now present a general class of upper bounds of the classiﬁcation error which will prove useful when the model is far from its initialization.

Lemma 1. Let

pν(y|X, θ) = p(y|X, ν) 1 + log p(y|X, θ)

with ν any value of the parameters. Then we have

pν(y|X, θ) p(y|X, θ) . (2.2)

Further, if ν = θ, we have

pθ(y|X, θ) = p(y|X, θ) , (2.3)

ν=θ = p(y|X, θ)

p(y|X, θ) = p(y|X, ν) p(y|X, θ)

p(y|X, ν) 1 + log p(y|X, θ)

= pν(y|X, θ) .

The second line stems from the inequality t 1 + log t.

Published as a conference paper at ICLR 2017

pν(y|X, θ) = p(y|X, θ) is immediate when setting θ = ν in Eq. 2.1. Deriving pν(y|X, θ) with respect to θ yields

θ = p(y|X, ν) log p(y|X, θ)

= p(y|X, ν)

p(y|X, θ) p(y|X, θ)

Taking θ = ν on both sides yields pν(y|X,θ)

ν=θ = p(y|X,θ)

Lemma 1 suggests that, if the current set of parameters is θt, an appropriate upper bound on the probability that an example will be correctly classiﬁed is

i p(yi|Xi, θ)

i p(yi|Xi, θt) 1 + log p(yi|Xi, θ)

p(yi|Xi, θt)

i p(yi|Xi, θt) log p(yi|Xi, θ) ,

where C is a constant independent of θ. We shall denote

i p(yi|Xi, θt) log p(yi|Xi, θ) . (2.5)

One possibility is to recompute the bound after every gradient step. This is exactly equivalent to directly minimizing L. Such a procedure is brittle. In particular, Eq. 2.5 indicates that, if an example is poorly classiﬁed early on, its gradient will be close to 0 and it will difﬁcult to recover from this situation. Thus, we propose using Algorithm 1 for supervised learning: In regularly recomputing

The data: A dataset D comprising of (Xi, yi) pairs, initial parameters θ0 The result: Final parameters θT for t = 0 to T-1 do

θt+1 = arg minθ Lθt = P

i p(yi|Xi, θt) log p(yi|Xi, θ) end

Algorithm 1: Iterative supervised learning

the bound, we ensure that it remains close to the quantity we are interested in and that we do not waste time optimizing a loose bound.

The idea of computing tighter bounds during optimization is not new. In particular, several authors used a CCCP-based (Yuille & Rangarajan, 2003) procedure to achieve tighter bounds for SVMs (Xu et al., 2006; Collobert et al., 2006; Ertekin et al., 2011). Though Collobert et al. (2006) show a small improvement of the test error, the primary goal was to reduce the number of support vectors to keep the testing time manageable. Also, the algorithm proposed by Ertekin et al. (2011) required the setting of an hyperparameter, s, which has a strong inﬂuence on the ﬁnal solution (see Fig. 5 in their paper). Finally, we are not aware of similar ideas in the context of the logistic loss.

Additionally, our idea extends naturally to the case where p is a complicated function of θ and not easily written as a sum of a convex and a concave function. This might lead to nonconvex inner optimizations but we believe that this can still yield lower classiﬁcation error. A longer study in the case of deep networks is planned.

Published as a conference paper at ICLR 2017

REGULARIZATION

As this model further optimizes the training classiﬁcation accuracy, regularization is often needed. The standard optimization procedure minimizes the following regularized objective:

θ = arg min θ X

i log p(yi|Xi, θ) + λΩ(θ)

= arg min θ X

1 K log p(yi|Xi, θ) + λ

Thus, we can view this as an upper bound of the following true objective:

θ = arg min θ X

i p(yi|Xi, θ) + λ

which can then be optimized using Algorithm 1.

ONLINE LEARNING

Because of its iterative nature, Algorithm 1 is adapted to a batch setting. However, in many cases, we have access to a stream of data and we cannot recompute the importance weights on all the points. A natural way around this problem is to select a parameter vector θ and to use ν = θ for the subsequent examples. One can see this as crystallizing the current solution as the value of ν chosen will affect all subsequent gradients.

3 EXPERIMENTS

We experimented the impact of using tighter bounds to the expected misclassiﬁcation rate on several datasets, which will each be described in their own section. The experimental setup for all datasets was as follows. We ﬁrst set aside part of the dataset to compose the test set. We then performed k-fold cross-validation, using a generalized linear model, on the remaining datapoints for different values of T, the number of times the importance weights were recomputed, and the ℓ2-regularizer λ. For each value of T, we then selected the set of hyperparameters (λ and the number of iterations) which achieved the lowest validation classiﬁcation error. We computed the test error for each of the k models (one per fold) with these hyperparameters. This allowed us to get a conﬁdence intervals on the test error, where the random variable is the training set but not the test set.

For a fair comparison, each internal optimization was run for Z updates so that ZT was constant. Each update was computed on a randomly chosen minibatch of 50 datapoints using the SAG algorithm (Le Roux et al., 2012). Since we used a generalized linear model, each internal optimization was convex and thus had no optimization hyperparameter.

Fig. 1 presents the training classiﬁcation errors on all the datasets.

3.1 COVERTYPE BINARY DATASET

The Covertype binary dataset (Collobert et al., 2002) has 581012 datapoints in dimension 54 and 2 classes. We used the ﬁrst 90% for the cross-validation and the last 10% for testing. Due to the small dimension of the input, linear models strongly underﬁt, a regime in which tighter bounds are most beneﬁcial. We see in Fig. 2 that using T > 1 leads to much lower training and validation classiﬁcation errors. Training and validation curves are presented in Fig. 2 and the test classiﬁcation error is listed in Table 1.

3.2 ALPHA DATASET

The Alpha dataset is a binary classiﬁcation dataset used in the Pascal Large-Scale challenge and contains 500000 samples in dimension 500. We used the ﬁrst 400000 examples for the cross-validation and the last 100000 for testing. A logistic regression trained on this dataset overﬁts quickly and, as a result, the results for all values of T are equivalent. Training and validation curves are presented in Fig. 3 and the test classiﬁcation error is listed in Table 2.

Published as a conference paper at ICLR 2017

Figure 1: Training classiﬁcation errors for covertype (top left), alpha (top right), MNist (bottom left) and IJCNN (bottom right). We can immediately see that all values of T > 1 yield signiﬁcant lower errors than the standard log-loss (the conﬁdence intervals represent 3 standard deviations).

Figure 2: Training (top) and validation (bottom) negative log-likelihood (left) and classiﬁcation error (right) for the covertype dataset. We only display the result for the value of λ yielding the lowest validation error. As soon as the importance weights are recomputed, the NLL increases and the classiﬁcation error decreases (the conﬁdence intervals represent 3 standard deviations).

T Z Test error 3σ (%) 1000 1e5 32.88 0.07 100 1e6 32.96 0.06 10 1e7 32.85 0.06 1 1e8 36.32 0.06

Table 1: Test error for the models reaching the best validation error for various values of T on the covertype dataset. We can see that any value of T greater than 1 leads to a signiﬁcant improvement over the standard log-loss (the conﬁdence intervals represent 3 standard deviations).

Published as a conference paper at ICLR 2017

Figure 3: Training (top) and validation (bottom) negative log-likelihood (left) and classiﬁcation error (right) for the alpha dataset. We only display the result for the value of λ yielding the lowest validation error. As soon as the importance weights are recomputed, the NLL increases. Overﬁtting occurs very quickly and the best validation error is the same for all values of T (the conﬁdence intervals represent 3 standard deviations).

T Z Test error 3σ (%) 1000 1e5 21.83 0.03 100 1e6 21.83 0.03 10 1e7 21.82 0.03 1 1e8 21.82 0.03

Table 2: Test error for the models reaching the best validation error for various values of T on the alpha dataset. We can see that overﬁtting occurs very quickly and, as a result, all values of T lead to the same result as the standard logloss.

3.3 MNIST DATASET

The MNist dataset is a digit recognition dataset with 70000 samples. The ﬁrst 60000 were used for the cross-validation and the last 10000 for testing. Inputs have dimension 784 but 67 of them are always equal to 0. Despite overﬁtting occurring quickly, values of T greater than 1 yield signiﬁcant improvements over the log-loss. Training and validation curves are presented in Fig. 4 and the test classiﬁcation error is listed in Table 3.

T Z Test error 3σ (%) 1000 1e5 7.00 0.08 100 1e6 7.01 0.05 10 1e7 6.97 0.08 1 1e8 7.46 0.11

Table 3: Test error for the models reaching the best validation error for various values of T on the MNist dataset. The results for all values of T strictly greater than 1 are comparable and signiﬁcantly better than for T = 1.

3.4 IJCNN DATASET

The IJCNN dataset is a dataset with 191681 samples. The ﬁrst 80% of the dataset were used for training and validation (70% for training, 10% for validation, using random splits), and the last 20% were used for testing samples. Inputs have dimension 23, which means we are likely to be in the underﬁtting regime. Indeed, larger values of T lead to signiﬁcant improvements over the log-loss. Training and validation curves are presented in Fig. 5 and the test classiﬁcation error is listed in Table 4.

Published as a conference paper at ICLR 2017

Figure 4: Training (top) and validation (bottom) negative log-likelihood (left) and classiﬁcation error (right) for the MNist dataset. We only display the result for the value of λ yielding the lowest validation error. As soon as the importance weights are recomputed, the NLL increases. Overﬁtting occurs quickly but higher values of T still lead to lower validation error. The best training error was 2.52% with T = 10.

Figure 5: Training (top) and validation (bottom) negative log-likelihood (left) and classiﬁcation error (right) for the IJCNN dataset. We only display the result for the value of λ yielding the lowest validation error. As soon as the importance weights are recomputed, the NLL increases. Since the number of training samples is large compared to the dimension of the input, the standard logistic regression is underﬁtting and higher values of T lead to better validation errors.

T Z Test error 3σ (%) 1000 1e5 4.62 0.12 100 1e6 5.26 0.33 10 1e7 5.87 0.13 1 1e8 6.19 0.12

Table 4: Test error for the models reaching the best validation error for various values of T on the IJCNN dataset. Larger values of T lead signiﬁcantly lower test errors.

Published as a conference paper at ICLR 2017

4 SUPERVISED LEARNING AS POLICY OPTIMIZATION

We now propose an interpretation of supervised learning which closely matches that of direct policy optimization in reinforcement learning. This allows us to naturally address common issues in the literature, such as optimizing ROC curves or allowing a classiﬁer to withhold taking a decision.

A machine learning algorithm is often only one component of a larger system whose role is to make decisions, whether it is choosing which ad to display or deciding if a patient needs a speciﬁc treatment. Some of these systems also involve humans. Such systems are complex to optimize and it is often appealing to split them into smaller components which are optimized independently. However, such splits might lead to poor decisions, even when each component is carefully optimized (Bottou). This issue can be alleviated by making each component optimize the full system with respect to its own parameters. Doing so requires taking into account the reaction of the other components in the system to the changes made, which cannot in general be modeled. However, one may cast it as a reinforcement learning problem where the environment is represented by everything outside of our component, including the other components of the system (Bottou et al., 2013).

Pushing the analogy further, we see that in one-step policy learning, we try to ﬁnd a policy p(y|X, θ) over actions y given the state X b to minimize the expected loss deﬁned as

y R(y, Xi)p(y|Xi, θ) . (4.1)

L(θ) is equivalent to L(θ) from Eq. 1.3 where all actions have a reward of 0 except for the action choosing the correct class yi yielding R(yi, Xi) = 1. One major difference between policy learning and supervised learning is that, in policy learning, we only observe the reward for the actions we have taken, while in supervised learning, the reward for all the actions is known.

Casting the classiﬁcation problem as a speciﬁc policy learning problem yields a loss function commensurate with a reward. In particular, it allows us to explicit the rewards associated with each decision, which was difﬁcult with Eq. 1.1. We will now review several possibilities opened by this formulation.

OPTIMIZING THE ROC CURVE

In some scenarios, we might be interested in other performance metrics than the average classiﬁcation error. In search advertising, for instance, we are often interested in maximizing the precision at a given recall. Mozer et al. (2001) address the problem by emphasizing the training points whose output is within a certain interval. Gasso et al. (2011); Parambath et al. (2014), on the other hand, assign a different cost to type I and type II errors, learning which values lead to the desired false positive rate. Finally, Bach et al. (2006) propose a procedure to ﬁnd the optimal solution for all costs efﬁciently in the context of SVMs and showed that the resulting models are not the optimal models in the class.

To test the impact of optimizing the probabilities rather than a surrogate loss, we reproduced the binary problem of Bach et al. (2006). We computed the average training and testing performance over 10 splits. An example of the training set and the results are presented in Fig. 6.

Even though working directly with probabilities solved the non-concavity issue, we still had to explore all possible cost asymmetries to draw this curve. In particular, if we had been asked to maximize the true positive rate for a given false positive rate, we would have needed to draw the whole curve then ﬁnd the appropriate point.

However, expressing the loss directly as a function of the probabilities of choosing each class allows us to cast this requirement as a constraint and solve the following constrained optimization problem:

θ = arg min θ 1

i/yi=1 p(1|xi, θ) such that 1

i/yi=0 p(1|xi, θ) c F P ,

b In standard policy learning, we actually consider full rollouts which include not only actions but also state changes due to these actions.

Published as a conference paper at ICLR 2017

Figure 6: Training data (left) and test ROC curve (right) for the binary classiﬁcation problem from Bach et al. (2006). The black dots are obtained when minimizing the log-loss for various values of the cost asymmetry. The red stars correspond to the ROC curve obtained when directly optimizing the probabilities. While the former is not concave, a problem already mentioned by Bach et al. (2006), the latter is.

with N0 (resp. N1) the number of examples belonging to class 0 (resp. class 1). Since p(1|xi, θ) = 1 p(0|xi, θ) , we can solve the following Lagrangian problem

min θ max λ 0 L(θ, λ) = min θ max λ 0 1

i/yi=1 p(1|xi, θ) + λ

i/yi=0 p(0|xi, θ) c F P

This is an approach proposed by Mozer et al. (2001) who then minimize this function directly. We can however replace L(θ, λ) with the following upper bound:

i/yi=1 p(1|xi, ν) 1 + log p(1|xi, θ)

i/yi=0 p(0|xi, ν) 1 + log p(0|xi, θ)

and jointly optimize over θ and λ. Even though the constraint is on the upper bound and thus will not be exactly satisﬁed during the optimization, the increasing tightness of the bound with the convergence will lead to a satisﬁed constraint at the end of the optimization. We show in Fig. 7 the obtained false positive rate as a function of the required false positive rate and see that the constraint is close to being perfectly satisﬁed. One must note, however, that the ROC curve obtained using the constrained optimization problems matches that of T = 1, i.e. is not concave. We do not have an explanation as to why the behaviour is not the same when solving the constrained optimization problem and when optimizing an asymmetric cost for all values of the asymmetry.

ALLOWING UNCERTAINTY IN THE DECISION

Let us consider a cancer detection algorithm which would automatically classify patients in two categories: healthy or ill. In practice, this algorithm will not be completely accurate and, given the high price of a misclassiﬁcation, we would like to include the possibility for the algorithm to hand over the decision to the practitioner. In other words, it needs to include the possibility of being Undecided .

The standard way of handling this situation is to manually set a threshold on the output of the classiﬁer and, should the maximum score across all classes be below that threshold, deem the example too hard to classify. However, it is generally not obvious how to set the value of that threshold nor how it relates to the quantity we care about, even though some authors provided guidelines (?). The difﬁculty is heightened when the prior probabilities of each class are very different.

Published as a conference paper at ICLR 2017

Figure 7: Test false positive rate as a function of the desired false positive rate c F P . The dotted line representing the optimal behaviour, we can see that the constraint is close to being satisﬁed. T = 10 was used.

Eq. 4.1 allows us to naturally include an extra action , the Undecided action, which has its own reward. This reward should be equal to the reward of choosing the correct class (i.e., 1) minus the cost ch of resorting to external intervention c, which is less than 1 since we would otherwise rather have an error than be undecided. Let us denote by rh = 1 ch the reward obtained when the model chooses the Undecided class. Then, the reward obtained when the input is Xi is:

R(yi|Xi) = 1

R( Undecided |Xi) = rh ,

and the average under the policy is p(yi|Xi, θ) + rhp( Undecided |Xi, θ).

Learning this model on a training set is equivalent to minimizing the following quantity:

θ = arg min θ 1

i (p(yi|Xi, θ) + rhp( Undecided |Xi, θ)) . (4.2)

For each training example, we have added another example with importance weight rh and class Undecided . If we were to solve this problem through a minimization of the log-loss, it is wellknown that the optimal solution would be, for each example Xi, to predict yi with probability 1/(1+rh) and Undecided with probability rh/(1+rh). However, when optimizing the weighted sum of probabilities, the optimal solution is still to predict yi with probability 1. In other words, adding the Undecided class does not change the model if it has enough capacity to learn the training set accurately.

5 DISCUSSION AND CONCLUSION

Using a general class of upper bounds of the expected classiﬁcation error, we showed how a sequence of minimizations could lead to reduced classiﬁcation error rates. However, there are still a lot of questions to be answered. As using T > 1 increases overﬁtting, one might wonder whether the standard regularizers are still adapted. Also, current state-of-the-art models, especially in image classiﬁcation, already use strong regularizers such as dropout. The question remains whether using T > 1 with these models would lead to an improvement.

Additionally, it makes less and less sense to think of machine learning models in isolation. They are increasingly often part of large systems and one must think of the proper way of optimizing them in this setting. The modiﬁcation proposed here led to an explicit formulation for the true impact of a classiﬁer. This facilitates the optimization of such a classiﬁer in the context of a larger production system where additional costs and constraints may be readily incorporated. We believe this is a critical venue of research to be explored further.

c This is assuming that the external intervention always leads to the correct decision. Any other setting can easily be used.

Published as a conference paper at ICLR 2017

ACKNOWLEDGMENTS

We thank Francis Bach, L eon Bottou, Guillaume Obozinski, and Vianney Perchet for helpful discussions.

Francis R Bach, David Heckerman, and Eric Horvitz. Considering cost asymmetry in learning classiﬁers. The Journal of Machine Learning Research, 7:1713 1741, 2006.

L eon Bottou. Two high stakes challenges in machine learning. http://videolectures. net/icml2015_bottou_machine_learning/.

L eon Bottou, Jonas Peters, Joaquin Quinonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14 (1):3207 3260, 2013.

Ronan Collobert, Samy Bengio, and Yoshua Bengio. A parallel mixture of svms for very large scale problems. Neural computation, 14(5):1105 1114, 2002.

Ronan Collobert, Fabian Sinz, Jason Weston, and L eon Bottou. Trading convexity for scalability. In Proceedings of the 23rd international conference on Machine learning, pp. 201 208. ACM, 2006.

S eyda Ertekin, L eon Bottou, and C Lee Giles. Nonconvex online support vector machines. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(2):368 381, 2011.

Gilles Gasso, Aristidis Pappaioannou, Marina Spivak, and L eon Bottou. Batch and online learning algorithms for nonconvex neyman-pearson classiﬁcation. ACM Transactions on Intelligent Systems and Technology, 2(3):28, 2011.

Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for ﬁnite training sets. In Advances in Neural Information Processing Systems, pp. 2663 2671, 2012.

Michael C Mozer, Robert H Dodier, Michael D Colagrosso, C esar Guerra-Salcedo, and Richard H Wolniewicz. Prodding the roc curve: Constrained optimization of classiﬁer performance. In NIPS, pp. 1409 1415, 2001.

Shameem Puthiya Parambath, Nicolas Usunier, and Yves Grandvalet. Optimizing f-measures by cost-sensitive classiﬁcation. In Advances in Neural Information Processing Systems, pp. 2123 2131, 2014.

Linli Xu, Koby Crammer, and Dale Schuurmans. Robust support vector machine training via convex outlier ablation. In AAAI, volume 6, pp. 536 542, 2006.

Alan L Yuille and Anand Rangarajan. The concave-convex procedure. Neural computation, 15(4): 915 936, 2003.