# minimax_risk_classifiers_with_01_loss__8731037f.pdf

Journal of Machine Learning Research 24 (2023) 1-48 Submitted 4/22; Revised 2/23; Published 7/23

Minimax Risk Classiﬁers with 0 -1 Loss

Santiago Mazuelas smazuelas@bcamath.org Basque Center for Applied Mathematics (BCAM) IKERBASQUE-Basque Foundation for Science Bilbao 48009, Spain

Mauricio Romero msicre@ufba.br Federal University of Bahia Ondina 40170, Brazil

Peter Gr unwald Peter.Grunwald@cwi.nl National Research Institute for Mathematics and Computer Science (CWI) Amsterdam 94079, Netherlands

Editor: Sivan Sabato

Supervised classiﬁcation techniques use training samples to learn a classiﬁcation rule with small expected 0 -1 loss (error probability). Conventional methods enable tractable learning and provide out-of-sample generalization by using surrogate losses instead of the 0 -1 loss and considering speciﬁc families of rules (hypothesis classes). This paper presents minimax risk classiﬁers (MRCs) that minimize the worst-case 0 -1 loss with respect to uncertainty sets of distributions that can include the underlying distribution, with a tunable conﬁdence. We show that MRCs can provide tight performance guarantees at learning and are strongly universally consistent using feature mappings given by characteristic kernels. The paper also proposes eﬃcient optimization techniques for MRC learning and shows that the methods presented can provide accurate classiﬁcation together with tight performance guarantees in practice.

Keywords: Supervised Classiﬁcation, Robust Risk Minimization, Performance Guarantees, Generalized Maximum Entropy

1. Introduction

Supervised classiﬁcation techniques use training samples to learn a classiﬁcation rule that assigns labels to instances with small expected 0 -1 loss (error probability). Conventional methods enable tractable learning and provide out-of-sample generalization by using surrogate losses instead of the 0 -1 loss and considering speciﬁc families of rules (hypothesis classes). The surrogate losses are usually taken to be convex upper bounds of the 0 -1 loss such as hinge loss, logistic loss, and exponential loss, see e.g., Bartlett et al. (2006). The families of rules considered are usually given by parametric functions such as those deﬁned by neural networks (NNs) and those belonging to reproducing kernel Hilbert spaces (RKHSs), see e.g., Shalev-Shwartz and Ben-David (2014). Such techniques can result in strongly universally consistent methods using surrogate losses that are classiﬁcation calibrated and Lipschitz (Bartlett et al., 2006; Tewari and Bartlett, 2007) together with rich families of rules such as

c 2023 Santiago Mazuelas, Mauricio Romero, and Peter Gr unwald.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-0339.html.

Mazuelas, Romero, and Gr unwald

those given by functions in RKHSs corresponding with universal kernels (Micchelli et al., 2006; Steinwart, 2005). Most learning methods are based on the empirical risk minimization (ERM) approach that minimizes the empirical expected loss of training samples (see e.g., Vapnik (1998); Mohri et al. (2018); Shalev-Shwartz and Ben-David (2014)). Other methods are based on the robust risk minimization (RRM) approach that minimizes the worst-case expected loss with respect to an uncertainty set of distributions (see e.g., Asif et al. (2015); Shaﬁeezadeh-Abadeh et al. (2019); Duchi and Namkoong (2019)). These methods correspond with generalized maximum entropy techniques as shown in Mazuelas et al. (2022) using the theoretical framework in Gr unwald and Dawid (2004). RRM methods mainly diﬀer in the type of uncertainty set considered. These sets are determined by constraints over probability distributions given in terms of metrics such as f-divergence (Duchi and Namkoong, 2019), Wasserstein distances (Shaﬁeezadeh-Abadeh et al., 2019), moments ﬁts (Asif et al., 2015), and maximum mean discrepancies (Staib and Jegelka, 2019). In addition, the uncertainty sets considered often only include probability distributions with instances marginal that coincides with the empirical marginal of training samples (Asif et al., 2015; Farnia and Tse, 2016; Fathony et al., 2016; Cortes et al., 2015). An important advantage of RRM methods is that the function minimized at learning can be an upper bound for the expected loss if the true underlying distribution is included in the uncertainty set considered. In such cases, RRM techniques automatically ensure out-ofsample generalization and provide performance guarantees at learning. The uncertainty sets considered by existing techniques can include the true underlying distribution in methods that address a transductive setting (Balsubramani and Freund, 2015, 2016) or utilize Wasserstein distances (Shaﬁeezadeh-Abadeh et al., 2019; Lee and Raginsky, 2018; Frogner et al., 2021) and maximum mean discrepancies (Staib and Jegelka, 2019). However, uncertainty sets formed by distributions with instances marginal that coincides with the empirical do not include the true underlying distribution for ﬁnite sets of training samples. The original 0 -1 loss is utilized by certain RRM techniques (Asif et al., 2015; Farnia and Tse, 2016; Fathony et al., 2016; Balsubramani and Freund, 2015) which provided inspiration for the present work. Most of these pioneering techniques considered uncertainty sets of distributions with instances marginal that coincides with the empirical of training samples, and therefore they do not provide tight performance guarantees at learning. Such performance guarantees are provided by PAC-Bayes methods (Ambroladze et al., 2007; Germain et al., 2015; Mhammedi et al., 2019) and RRM techniques in transductive settings (Balsubramani and Freund, 2015) or based on Wasserstein distances (Shaﬁeezadeh-Abadeh et al., 2019; Lee and Raginsky, 2018). PAC-Bayes methods consider speciﬁc classiﬁcation rules such as margin-based and ensemble-based classiﬁers. In addition, the tightness of their performance bounds relies on that of multiple inequalities including Jensen s, Markov s, and Donsker-Varadhan s change of measure (B egin et al., 2016). Wasserstein-based RRM techniques utilize surrogate losses and consider speciﬁc families of rules. In addition, the tightness of their performance bounds relies heavily on the adequacy of the Wasserstein radius used (Shaﬁeezadeh-Abadeh et al., 2019; Frogner et al., 2021). This paper presents minimax risk classiﬁers (MRCs) that minimize the worst-case 0 -1 loss over general classiﬁcation rules and provide tight performance bounds at learning. Speciﬁcally, the main results presented in the paper are as follows.

Minimax Risk Classifiers with 0 -1 Loss

We develop learning methods that obtain classiﬁcation rules with the smallest worstcase expected 0 -1 loss with respect to uncertainty sets that can include the underlying distribution, with a tunable conﬁdence (Section 2.2). In addition, we detail how to determine uncertainty sets from training data by estimating the expectation of a feature mapping and obtaining conﬁdence vectors for such estimates (Section 3).

We characterize the performance guarantees of MRCs in terms of tight upper and lower bounds for the error probability and also in terms of generalization bounds with respect to the smallest minimax risk (Sections 4.1 and 4.2). In addition, we show that MRCs are strongly universally consistent using feature mappings given by characteristic kernels (Section 4.3).

We present eﬃcient optimization techniques for MRC learning that obtain the classiﬁers parameters and the tight performance bounds using reduced sets of instances and eﬃcient accelerated subgradient methods (Section 5). In addition, we quantify MRCs performance with respect to existing techniques and show the suitability of the performance bounds presented (Section 6).

Some of the results presented in this paper have appeared before in Mazuelas et al. (2020). The main new results presented in this paper include: extension to instances sets that are general Borel subsets; analysis of the generalization capabilities of the approach presented even in cases where the underlying distribution is not included in the uncertainty set considered; universal consistency of the methods proposed using rich feature mappings; and eﬃcient optimization techniques based on accelerated subgradient methods. Speciﬁcally, the proofs of the theoretical results are extended to allow for inﬁnite sets of instances instead of ﬁnite sets by using Fenchel duality instead of Lagrange duality. The MRCs generalization bounds are extended in new Theorem 7 to cases when the uncertainty set does not include the underlying distribution, and new Theorem 8 shows the universal consistency of MRCs that use feature mappings given by characteristic kernels. In addition, new Theorems 9 and 10 show that MRC learning can be eﬃciently addressed using reduced sets of instances and subgradient methods that exploit the speciﬁc structure of the optimization problems in MRC learning. Notation: calligraphic upper case letters denote sets, e.g., Z; |Z| denotes de cardinality of set Z; vectors and matrices are denoted by bold lower and upper case letters, respectively, e.g., v and M; for a vector v, v , v 1, and v denote its L2-, L1-, and L-inﬁnity norms, respectively, v T denotes its transpose, v(i) denotes its i-th component, |v| and v+ denote the vector given by the component-wise absolute value and positive part of v, respectively, and diag(v) denotes the diagonal matrix with diagonal given by v; probability distributions and classiﬁcation rules are denoted by upright fonts, e.g., p and h; Ez p or simply Ep denotes the expectation w.r.t. probability distribution p of random variable z; for a probability distribution p over X Y, we denote by px, py its corresponding marginals over X and Y, respectively, and by px|y the corresponding conditional distribution given y Y; and denote vector (component-wise) inequalities; 1 denotes a vector with all components equal to 1 and I{ } denotes the indicator function; ﬁnally, for a function of two variables f : X Y Z, f(x, ) denotes the function f(x, ) : Y Z obtained by ﬁxing x X.

Mazuelas, Romero, and Gr unwald

2. Minimax risk classiﬁers

This section ﬁrst states the problem of supervised classiﬁcation and summarizes diﬀerent learning approaches, then we describe MRCs with 0 -1 loss and their relationship with existing techniques.

2.1 Problem formulation and learning approaches

Classiﬁcation techniques assign instances in a set X to labels in a set Y, where X is a Borel subset of Rd and Y is a ﬁnite set represented by {1, 2, . . . , |Y|}. We denote by (X Y) the set of Borel probability measures p on X Y, also referred to as probability distributions. Both randomized and deterministic classiﬁcation rules are given by functions from instances to probability distributions on labels (Markov transitions). We denote the set of all classiﬁcation rules by T(X, Y), and for h T(X, Y) we denote by h(y|x) the probability with which instance x X is classiﬁed by label y Y (h(y|x) {0, 1} if labels are deterministically assigned to instances). Supervised classiﬁcation techniques use n training instance-label pairs (x1, y1), (x2, y2), . . . , (xn, yn) from the underlying distribution p to ﬁnd classiﬁcation rules that assign labels to instances with small error probability. If p (X Y) is the true underlying distribution of instance-label pairs, the error probability of a classiﬁcation rule h T(X, Y) is its expected 0 -1 loss denoted by R(h), that is

R(h) = Ep {ℓ(h, (x, y))}

with ℓ(h, (x, y)) = 1 h(y|x) the 0 -1 loss of rule h at instance-label pair (x, y). In the following, we overload the notation for the loss function and denote the expected loss of h with respect to probability distribution p as ℓ(h, p) := Ep{ℓ(h, (x, y))}. The optimal classiﬁcation rule is the solution of the optimization problem

P : inf h T(X,Y) ℓ(h, p ). (1)

This solution is known as Bayes classiﬁcation rule and the minimum value above is known as Bayes risk. Optimization problem P cannot be addressed in practice since the underlying distribution p is unknown and only training samples (x1, y1), (x2, y2), . . . , (xn, yn) from p

are available. Existing supervised classiﬁcation techniques can be interpreted as approximations of P

by an optimization problem of the form

P : inf h F

sup p U L(h, p) (2)

for F a family of classiﬁcation rules, U an uncertainty set of distributions, and L a surrogate loss function. The ERM approach considers uncertainty sets of distributions that contain only the empirical distribution of training samples. In such an approach, the approximation of P

by P is controlled by the choice of the family F T(X, Y) through a bias-complexity trade-oﬀaddressed by structural risk minimization (SRM) (Vapnik, 1998). Families of rules reduced enough to provide uniform convergence of empirical averages can ensure that

Minimax Risk Classifiers with 0 -1 Loss

the objective function in P uniformly approximates that of P , i.e., empirically averaged losses accurately approximate expected losses for all the rules considered. If the family of rules is also general enough to contain a classiﬁcation rule near the Bayes rule, then the minimum value of P is similar to the minimum value of P and the ERM approach leads to near-optimal performance. This trade-oﬀfor the generality of the family of rules considered is usually controlled by selecting a parameter that determines the size of the family of rules, e.g., the radius of the RKHS ball of functions that deﬁnes the family of rules. The RRM approach considers uncertainty sets of distributions determined by constraints obtained from training samples. In such an approach, the approximation of P by P is controlled by the choice of the uncertainty set U (X Y). Uncertainty sets general enough to contain the underlying distribution can ensure that the objective function in P upper bounds that of P , i.e., the worst-case expected loss upper bounds the actual expected loss for any classiﬁcation rule. If the uncertainty set is also reduced enough to provide a tight upper bound, then the minimum value of P is similar to the minimum value of P and the RRM approach leads to near-optimal performance. This trade-oﬀfor the generality of the uncertainty set considered is usually controlled by selecting a parameter that determines the size of the uncertainty set, e.g., the radius of the Wasserstein ball that deﬁnes the uncertainty set. An important advantage of the RRM approach with respect to ERM is that the former does not require to constrain the classiﬁcation rules considered in order to provide provable out-of-sample generalization. Such property is due to the fact that the optimum of P for RRM upper bounds the expected loss with respect to any distribution in the uncertainty set, independently of the family of rules considered. On the other hand, the optimum of P for ERM is ensured to be near the out-of-sample risk only if the family of classiﬁcation rules considered provides uniform convergence of empirical averages. Optimization problem P not only has to reliably approximate P but also has to be tractable computationally. Such tractability is commonly achieved by 1) considering families of classiﬁcation rules determined by certain parameters (e.g., weights of neurons in NNs or coeﬃcients of linear classiﬁers), and 2) substituting the 0 -1 loss ℓby a surrogate loss L (e.g., hinge-loss or logistic-loss). The usage of non-parametric classiﬁcation rules can lead to huge-scale optimization problems (a general rule h T(X, Y) is determined by |X||Y| values), and the minimization of 0 -1 loss is often NP-hard (see e.g., Ben-David et al. (2003); Feldman et al. (2012)). In the following we show how the proposed MRCs approximate the optimization problem P in (1) by an optimization problem of the form

PMRC : inf h T(X,Y)

sup p U ℓ(h, p) (3)

for an uncertainty set U that can include the underlying distribution with a tunable conﬁdence. MRCs do not rely on a choice of surrogate loss and family of rules. The only change in PMRC with respect to P consists on using an uncertainty set U instead of the underlying distribution p . Certain previous RRMs methods also address an optimization problem of the form (3) (Asif et al., 2015; Fathony et al., 2016). However, the uncertainty sets considered by such methods only include distributions with instances marginals that coincide with the

Mazuelas, Romero, and Gr unwald

empirical. With such an additional constraint for the uncertainty set, these approaches become equivalent to ERM with a surrogate loss referred to as adversarial zero-one loss. Therefore, their out-of-sample generalization properties are akin to those of other ERM methods and cannot exploit the RRM s beneﬁts of approximating P by an upper bound.

2.2 MRCs with 0 -1 loss

The classiﬁcation loss ℓ(h, (x, y)) of rule h at instance-label pair (x, y) quantiﬁes the loss of the rule evaluated at instance x X when the label is y Y. In this paper we consider 0 -1 loss that is given by ℓ(h, (x, y)) = 1 h(y|x), while MRCs for general loss functions are described in (Mazuelas et al., 2022). The usage of 0 -1 loss is specially suitable for discriminative approaches since it quantiﬁes the classiﬁcation error, while other loss functions such as logistic loss can be more suitable for conditional probability estimation since they score probability assessments. Speciﬁcally, if p is the true underlying distribution of instance-label pairs, the expected 0 -1 loss ℓ(h, p ), also referred to as the risk R(h), coincides with the error probability of classiﬁcation rule h. The proposed MRCs minimize the worst-case expected 0 -1 loss with respect to distributions in uncertainty sets that can contain the true underlying distribution with a tunable conﬁdence. These uncertainty sets are given by constraints on the expectations of a vector-valued bounded and Borel measurable function Φ : X Y Rm referred to as feature mapping, e.g., multiple polynomials on x and y or one-hot encodings of the last layers in an NN. Such mappings are commonly used in machine learning to represent instance-label pairs as real vectors (see e.g., Mohri et al. (2018); Bengio et al. (2013) and next Section 3.1). In the following we consider uncertainty sets

U = n p (X, Y) : |Ep{Φ} τ| λ o (4)

given by a feature mapping Φ together with mean and conﬁdence vectors τ Rm and λ Rm, and satisfying one of the following regularity conditions.

R1 The set X is ﬁnite and there exists a probability measure p in U.

R2 There exists a probability measure p in U such that |Ep{Φ(x, y)(i)} τ (i)| < λ(i) for any i {1, 2, . . . , m} with λ(i) > 0 and

R2.1 λ(i) > 0 for all i {1, 2, . . . , m} or R2.2 the support of the r.v. Φ(x, y) for (x, y) p is not contained in a proper aﬃne subspace of Rm.

These regularity conditions are utilized to ensure strong duality holds for the inner maximization in the minimax problems, and they are satisﬁed with wide generality. For instance, such conditions are satisﬁed if there exists a probability measure p such that Ep{Φ} = τ and {Φ(xi, yi), i = 1, 2, . . . , n} is not contained in a proper subspace of Rm. In addition, if those n points are contained in a proper aﬃne subspace Γ Rm with dim Γ = m < m the feature mapping could be easily modiﬁed so that R2.2 is satisﬁed for instance by using

Φ (x, y) = (Φ(x, y) z0)Tv1, (Φ(x, y) z0)Tv2, . . . , (Φ(x, y) z0)Tvm Rm

Minimax Risk Classifiers with 0 -1 Loss

where Γ = {z0 + α1v1 + α2v2 + . . . + αm vm ; α1, α2, . . . , αm R}. The mean vector τ Rm in (4) is an estimate of the feature mapping expectation Ep {Φ} with respect to the underlying distribution. In this paper, we consider expectation estimates obtained as the sample average

i=1 Φ(xi, yi) (5)

obtained from the n training samples (x1, y1), (x2, y2), . . . , (xn, yn) that are assumed to be independent samples from the underlying distribution p . Nevertheless, it is important to note that most of the results presented in the paper as well as the general methodology proposed can be utilized with general types of expectation estimates. Alternative estimators for the expectation can be preferred in several practical scenarios including cases with diﬀerent distributions at training and test (Mohri and Medina, 2012; Mazuelas and Perez, 2020; Alvarez et al., 2022) and cases where the distribution of features has heavy tails (Lugosi and Mendelson, 2019; Hsu and Sabato, 2016). The conﬁdence vector λ Rm in (4) is an estimate of the mean vector component-wise accuracy |Ep {Φ} τ|, and controls the size of the uncertainty set considered. In particular, if λ corresponds to the length of conﬁdence intervals at level 1 δ centered at τ, then the true underlying distribution is included in the uncertainty set (4) with probability at least 1 δ. Section 3.2 describes how to choose such conﬁdence vectors for diﬀerent feature mappings. Classiﬁcation rules that minimize the worst-case error probability over uncertainty sets U given by (4) are referred to as MRCs.

Deﬁnition 1 We say that a classiﬁcation rule h U is a 0 -1 MRC for uncertainty set U if

h U arg inf h T(X,Y)

sup p U ℓ(h, p)

and we denote by R(U) the minimax risk against U, i.e.,

R(U) = inf h T(X,Y)

sup p U ℓ(h, p).

The following result shows how 0 -1 MRCs can be determined by a linear-aﬃne combination of the feature mapping. The coeﬃcients of such combination can be obtained at learning by solving the convex optimization problem

Pτ,λ : min µ 1 τ Tµ + ϕ(µ) + λT|µ| (6)

ϕ(µ) = sup x X,C Y

P y C Φ(x, y)Tµ 1

1. Here, as in the sequel, we implicitly assume that the set C Y over which we take the maximum excludes the empty set.

Mazuelas, Romero, and Gr unwald

Theorem 2 Let U be an uncertainty set as in (4) that satisﬁes R1 or R2 and µ be a solution of optimization problem Pτ,λ. If a classiﬁcation rule h U T(X, Y) satisﬁes

h U(y|x) Φ(x, y)Tµ ϕ(µ ), x X, y Y (8)

then h U is a 0 -1 MRC for U. In addition, we have that

R(U) = 1 τ Tµ + ϕ(µ ) + λT|µ |. (9)

Proof See Appendix B.

A classiﬁcation rule satisfying (8) always exists because for every x X the sum over y Y of the positive part of the right hand side of (8) is not larger than one. Speciﬁcally, for any x X, the number

Φ(x, y)Tµ ϕ(µ )

is zero or equal to max C Y

y C Φ(x, y)Tµ |C|ϕ(µ )

that is not larger than one by deﬁnition of ϕ( ) in (7). The characterization of MRCs in terms of the inequality in (8) may seem counter-intuitive because for some pairs (x, y) such classiﬁcation rules are not uniquely determined. However, such pairs do not aﬀect the worst-case error probability. Speciﬁcally, the set of pairs (x, y) where (8) is not satisﬁed with equality has zero probability under a worst-case distribution in U because for p U U

inf h T(X,Y)

sup p U ℓ(h, p) = ℓ(h U, p U)

1 Z h U(y|x)dp U(x, y) = 1 τ Tµ + λT|µ | + ϕ(µ )

Z h U(y|x) Φ(x, y)Tµ ϕ(µ ) dp U(x, y) = τ Ep UΦ(x, y) Tµ λT|µ |

Z h U(y|x) Φ(x, y)Tµ ϕ(µ ) dp U(x, y) = 0.

since p U U τ Ep UΦ(x, y) Tµ λT|µ | 0. In order to avoid ambiguities, we will refer to 0 -1 MRC for uncertainty set U as the classiﬁcation rule h U obtained by normalizing the positive part of the right hand side of (8). Speciﬁcally, let for x X

h U(y|x) = (Φ(x, y)Tµ ϕ(µ ))+/cx if cx = 0 1/|Y| if cx = 0 (11)

with cx given by (10). Such classiﬁcation rule is univocally determined by µ and satisﬁes (8) because cx 1 for any x X. In addition, we denote by h U d the deterministic classiﬁcation rule corresponding to h U, that is

h U d(y|x) = I{y arg max h U( |x)} = I{y arg max Φ(x, )Tµ } (12)

Minimax Risk Classifiers with 0 -1 Loss

where a tie in the arg max above can be resolved arbitrarily. In the following, we will refer to such classiﬁcation rule h U d as the deterministic 0 -1 MRC for U. The above Theorem 2 provides a representer theorem for MRCs. The classical representer theorem for ERM over RKHSs (see e.g., Mohri et al. (2018); Evgeniou et al. (2000)) states that the solution of ERM over the set of classiﬁcation rules given by functions in an RKHS ball is determined by a linear combination of n functions corresponding with the training samples. Such result enables to address the minimization of empirical loss over an inﬁnitedimensional RKHS because it becomes equivalent to a minimization over n parameters. Analogously, Theorem 2 states that the solution of PMRC over the set of all classiﬁcation rules is determined by a linear-aﬃne combination of the m components of the feature mapping. Even if the optimization problem addressed by MRCs does not impose constraints on the classiﬁcation rules considered, the feature mapping utilized determines the parametric form of its solutions. Theorem 2 enables to address the minimization of the worst-case error probability over general classiﬁcation rules since it becomes equivalent to a minimization over the m parameters µ. As shown in Section 5 below, optimization (6) can be eﬃciently addressed in practice. In particular, Theorem 9 shows that the set X in (7) can be substituted by a reduced set of instances such as that formed by the instances at training. Note also that MRCs are often given by sparse combinations of the components of the feature mapping since the last term in optimization problem (6) imposes an L1-type regularization for parameters. L1-norm regularization is broadly used in machine learning (see e.g., Hastie et al. (2019); Mohri et al. (2018); Mol et al. (2009)) and the regularization parameter used to weight the L1-norm is commonly obtained by cross-validation methods. The result in Theorem 2 can directly provide appropriate regularization parameters from the length of the expectations interval estimates. In addition, the regularization term λT|µ| = Pm i=1 λ(i)|µ(i)| in (6) allows to penalize diﬀerently each component of µ. Such type of L1-norm regularization is usually referred to as adaptive or weighted, and has shown to signiﬁcantly improve performance (Zou, 2006; Cand es et al., 2008). For MRCs, the regularization term causes that feature components with poorly estimated expectations (i.e., components Φ(i) with large λ(i)) have a reduced or null inﬂuence on the ﬁnal classiﬁcation rule.

2.3 MRCs with 0 -1 loss and ﬁxed marginals

The following result shows how existing RRM methods that utilize 0-1 loss correspond to MRCs that use uncertainty sets of distributions with instances marginal given by the empirical distribution. In particular, such MRCs correspond to L1-regularized ERM with a loss referred to as adversarial zero-one in Fathony et al. (2016) or as minimax hinge in Farnia and Tse (2016). For the following result we consider uncertainty sets of the form

V = n p (X, Y) : |Ep{Φ} τ| λ and px = pn x o (13)

where pn x is the uniform distribution over instances with support {x1, x2, . . . , xn} X for n speciﬁc instances.

Mazuelas, Romero, and Gr unwald

Theorem 3 Let τ, λ Rm be such that the uncertainty set V in (13) is not empty. If µ

is a solution of the optimization problem

min µ 1 τ Tµ + 1

i=1 ϕ(µ, xi) + λT|µ| (14)

ϕ(µ, x) = max C Y

P y C Φ(x, y)Tµ 1

and h V is the classiﬁcation rule

h V(y|x) = Φ(x, y)Tµ ϕ(µ , x)

+, x X, y Y (15)

then, h V is a 0 -1 MRC for V, that is

h V arg inf h T(X,Y)

sup p V ℓ(h, p).

Proof See Appendix C.

If the instances {x1, x2, . . . , xn} are those obtained at training, the 0 -1 MRCs for uncertainty sets given by both expectations and marginals constrains as in (13) correspond to existing techniques known as maximum entropy machines (Farnia and Tse, 2016) or zero-one adversarial (Fathony et al., 2016). Speciﬁcally, if {(xi, yi)}n i=1 are training samples and τ is given by (5), then optimization problem (14) becomes

i=1 max C Y

P y C(Φ(xi, y) Φ(xi, yi))Tµ + |C| 1

|C| + λT|µ|

that corresponds to L1-regularized ERM with a loss referred to as minimax hinge in Farnia and Tse (2016) or as adversarial zero-one in Fathony et al. (2016). Uncertainty sets V given by (13) do not contain the true underlying distribution for a ﬁnite number of training samples since such sets only contain distributions with instances marginal that is uniform over the training instances. Hence, the usage of such uncertainty sets does not allow to obtain the performance guarantees shown in Section 4 below for uncertainty sets given by (4).

3. Uncertainty sets of distributions

As described above, MRC classiﬁcation rules have the smallest worst-case error probability over distributions in an uncertainty set. This set is determined by expectations estimates of a feature mapping. In this section we ﬁrst describe common feature mappings and then characterize the accuracy of expectation estimates obtained from training samples.

Minimax Risk Classifiers with 0 -1 Loss

3.1 Feature mappings

Most of the results presented in the paper are valid for general feature mappings Φ : X Y Rm. In this section we describe feature mappings that are common in supervised classiﬁcation and will be used later in the paper. Such feature mappings are given by real-valued functions over the set of instances ψ : X R, referred to as scalar features. The most common and simple way to deﬁne feature mappings over X and Y is to use multiple scalar features over X together with a one-hot encoding of the elements of Y (Tsochantaridis et al., 2005; Crammer et al., 2006; Mohri et al., 2018) as follows

Φ(x, y) = [I{y = 1}Ψ(x)T, I{y = 2}Ψ(x)T, . . . , I{y = |Y|}Ψ(x)T]T = ey Ψ(x) (16)

where Ψ(x) = [ψ1(x), ψ2(x), . . . , ψD(x)]T for D scalar features ψ1, ψ2, . . . , ψD, ey is the y-th vector in the standard basis of R|Y|, and denotes the Kronecker product. The feature mapping Φ represents each instance-label pair (x, y) by an m-dimensional real vector with m = |Y|D, so that Φ(x, y) is composed by |Y| D-dimensional blocks with values Ψ(x) in the block corresponding to y and zero otherwise. Other feature mappings can exploit relationships among classes as described for instance in Bakir et al. (2007); Caponnetto et al. (2008). Multiple types of scalar features are commonly used in supervised classiﬁcation including those given by thresholds/decision stumps (Lebanon and Laﬀerty, 2001), last layers in an NN (Bengio et al., 2013), and random features corresponding to a RKHS (Rahimi and Recht, 2008). Most of the results shown in this paper are valid for general features, while for those showing the universal consistency of MRCs we use random features that embed instances into rich feature spaces corresponding to RKHSs.

3.2 Conﬁdence vectors for expectation estimates

In this section, we describe conditions for conﬁdence vectors λ that lead to uncertainty sets given by (4) that contain the true underlying distribution with high probability. Speciﬁcally, we consider uncertainty sets U given by feature mappings deﬁned by (16) using scalar features in a family F together with mean vectors obtained as in (5) using the sample average of n independent samples from the underlying distribution p . We denote by λδ any conﬁdence vector that has coverage probability at least 1 δ, i.e., P{|Ep {Φ} τ| λδ} 1 δ, so that the underlying distribution p is included with probability at least 1 δ in the uncertainty set given by λδ. The following result describes such conﬁdence vectors in terms of the cardinality |F|, the empirical variance of the feature mapping, and the Rademacher complexity Rn(F). In most of the results in the paper, we utilize feature mappings given by scalar features chosen before the training samples are observed. In such cases, appropriate conﬁdence vectors are given in terms of the cardinality |F|, i.e., the number of scalar features used. However, the methods proposed can also be used with data-dependent feature mappings. In such cases, appropriate conﬁdence vectors are given in terms of the Rademacher complexity Rn(F) where F is the family of candidate scalar features. For instance, F can be family of all decision stumps and the feature mapping can be deﬁned using the decision stumps found by one-dimensional decision trees learned using the training samples, as in Mazuelas et al. (2020).

Mazuelas, Romero, and Gr unwald

Theorem 4 Let the mean vector be given by τ = τ n as in (5) for n training samples, and F be a family of bounded scalar features, that is |ψ(x)| C for all ψ F and x X. If |F| < , λ(i) δ for i = 1, 2, . . . , m can be taken as

2 log 2|F||Y|/δ

In addition, if υ(i) n is the sample variance of the i-th component of Φ, λ(i) δ for i = 1, 2, . . . , m can be taken as

λ(i) δ = 2C

2υ(i) n log 4|F||Y|/δ

n + 14C log 4|F||Y|/δ

3(n 1) . (18)

If the Rademacher complexity of F satisﬁes Rn(F) R/ n, λ(i) δ for i = 1, 2, . . . , m can be taken as

where j(i) denotes the label for which the i-th component of Φ is non-zero, and nj denotes the number of samples with label j Y.

Proof See Appendix D.

The previous result shows how to obtain uncertainty sets U as in (4) that contain the true underlying distribution with high probability. Data-based tight conﬁdence vectors can be obtained using sample standard deviations as shown in (18). Note that sample standard deviations can also be used to determine conﬁdence vectors for inﬁnite families of scalar features by using their growth function (Maurer and Pontil, 2009). Theorem 4 shows that conﬁdence vectors λ can decrease at a rate O(1/ n) using general bounded features. Such rate can be obtained similarly with unbounded sub-Gaussian features (Wainwright, 2019) and also with heavy-tailed features by using robust estimators for expectations (Lugosi and Mendelson, 2019). In addition, the order O( p

log(|F|)/n) for λ in the above result is consistent with that shown to provide consistent estimators using L1-regularization (B uhlmann and van de Geer, 2011).

4. Performance guarantees

This section characterizes the out-of-sample performance of 0 -1 MRCs. We ﬁrst present techniques that provide tight performance bounds at learning, then we show ﬁnite-sample generalization bounds for MRCs. In particular, we show that the proposed techniques can provide strong universal consistency using rich feature representations.

4.1 Tight performance bounds

The following result shows that the proposed approach allows to obtain upper and lower bounds for expected losses by solving convex optimization problems.

Minimax Risk Classifiers with 0 -1 Loss

Theorem 5 Let U be an uncertainty set given by (4), h be any classiﬁcation rule, and R(U, h) and R(U, h) be given by

R(U, h) = sup µ 1 τ Tµ + inf x X,y Y{Φ(x, y)Tµ h(y|x)} λT|µ| (20)

R(U, h) = inf µ 1 τ Tµ + sup x X,y Y {Φ(x, y)Tµ h(y|x)} + λT|µ|. (21)

Then, for any p U we have that R(U, h) ℓ(h, p) R(U, h). In addition, if U satisﬁes R1 or R2 the optimal values in (20) and (21) are attained.

Proof See Appendix E.

The techniques proposed in (Duchi et al., 2021; Shaﬁeezadeh-Abadeh et al., 2015, 2019) obtain upper and lower bounds corresponding with RRM methods that use uncertainty sets deﬁned in terms of f-divergences and Wasserstein distances. Such methods obtain classiﬁcation rules by minimizing the upper bound of a surrogate expected loss while MRCs minimize the upper bound of the 0 -1 expected loss (error probability). The bounds in (Duchi et al., 2021; Shaﬁeezadeh-Abadeh et al., 2015, 2019) as well as those in Theorem 5 become bounds for the true expected loss (risk) if the uncertainty set considered includes the true underlying distribution. Such situation can be attained with a tunable conﬁdence using uncertainty sets deﬁned by Wasserstein distances as in (Shaﬁeezadeh-Abadeh et al., 2015, 2019) or using the proposed uncertainty sets in (4) with expectation conﬁdence intervals. The above theorem provides lower and upper bounds for the error probability of any classiﬁcation rule. These bounds can be determined by solving the two convex optimization problems (20) and (21). As shown below, the upper bound for MRCs is obtained as a by-product of the learning process that solves optimization (6).

Corollary 6 Let h U be a 0 -1 MRC for an uncertainty set U that satisﬁes R1 or R2, then R(U, h U) = R(U) given by (9).

Proof Straightforward consequence of the above results since

R(U, h U) = sup p U ℓ(h U, p) = inf h T(X,Y)

sup p U ℓ(h, p) = R(U)

In order to simplify notation, we will denote R(U) := R(U, h U) for an MRC h U given by (11). Tight performance bounds can also be obtained at learning for deterministic 0 -1 MRCs. Speciﬁcally, such bounds are given by R(U, h U d) and R(U, h U d) that are obtained by solving the two optimization problems (20) and (21). The next section provides generalization bounds for 0 -1 MRCs. Such bounds also ensure generalization for deterministic 0 -1 MRCs because R(h U d) 2R(h U) as a direct consequence of the fact that 1 h U d(y|x) 2(1 h U(y|x)), x, y X Y.

Mazuelas, Romero, and Gr unwald

4.2 Finite-sample generalization bounds

This section provides MRCs generalization bounds in terms of optimal minimax rules, i.e., MRCs with the smallest minimax risk. Such rules correspond with uncertainty sets given by the true expectation of the feature mapping Φ, that is

U = {p (X Y) : Ep{Φ(x, y)} = τ } (22)

where the mean vector τ = Ep {Φ(x, y)} is the exact expectation with respect to the underlying distribution. The minimum wost-case error probability (minimax risk) over distributions in U is given by

RΦ = min µ 1 τ T µ + ϕ(µ) = 1 τ T µ + ϕ(µ )

corresponding with the MRC given by parameters µ . Such classiﬁcation rule is referred to as the optimal minimax rule for feature mapping Φ because for any uncertainty set U given by (4) that contains the underlying distribution p , we have that U U and hence RΦ R(U). The optimal minimax rule could only be obtained by an exact estimation of the expectation of the feature mapping Φ that in turn would require an inﬁnite amount of training samples. The following result provides generalization bounds for MRCs in terms of excess error probability with respect to the minimax risk at learning and the smallest minimax risk for the feature mapping used.

Theorem 7 Let U and U be uncertainty sets given by (4) and (22), respectively, that satisfy R1 or R2. If h U is a 0 -1 MRC for uncertainty set U, we have that

R(h U) R(U) + (|τ τ| λ)T|µ | (23)

R(h U) R(U) (|τ τ| λ)T|µ| (24)

R(h U) RΦ + |τ τ|T|µ µ | + λT(|µ | |µ |) (25)

with µ and µ solutions to (6) and (20), respectively. In particular, if λ is a conﬁdence vector with coverage probability 1 δ, i.e., P{|τ τ| λ} 1 δ, then, we have that

R(U) R(h U) R(U) (26)

R(h U) RΦ + 2λT|µ | RΦ + 2 λ µ 1. (27)

with probability at least 1 δ.

Proof See Appendix F.

Inequalities (23), (24), and (26) bound MRCs probabilities of error w.r.t. the corresponding minimax error probability R(U) and lower bound R(U). Such quantities can be obtained at learning, R(U) is given as a byproduct of the learning process that solves optimization Pτ,λ in (6) while R(U) requires to solve an additional convex optimization problem given by (20). Inequalities (25) and (27) bound MRCs probabilities of error w.r.t.

Minimax Risk Classifiers with 0 -1 Loss

the smallest minimax risk RΦ. As shown in the next section, this smallest minimax risk becomes the Bayes risk using rich feature mappings so that such generalization bounds enable to prove universal consistency results for MRCs. The generalization bounds in the above result depend on the accuracy of mean vector estimates and on the conﬁdence vector used. Several important conclusions can be drawn from such bounds:

The excess error probability with respect to minimax risks decreases at the same rate as the error of the mean vector estimates. As shown in Theorem 4, with wide generality the bounds in the above theorem show diﬀerences that decrease with n at a rate O(1/ n) using mean vector estimates given by sample averages as in (5). Methods that utilize 0 -1 loss but consider uncertainty sets with ﬁxed marginals provide signiﬁcantly coarser performance guarantees. In particular, the bounds in Theorem 3 of Farnia and Tse (2016) are O(1/ε n) where ε describes the norm of the error in sample mean estimates, which is commonly ε = O(1/ n). In addition, RRM methods based on Wasserstein distances achieve generalization bounds that decrease with n at a rate O(1/n1/d) that deteriorates with the instances dimensionality d (Shaﬁeezadeh-Abadeh et al., 2019; Frogner et al., 2021).

MRCs do not heavily rely on the choice of conﬁdence vectors. For a given mean vector τ, the upper bound in (23) takes its smallest value for the MRC corresponding with the conﬁdence vector λ given by the error |τ τ| in the mean vector estimate. Speciﬁcally, if Ue is the uncertainty set given by (4) with mean vector τ and conﬁdence vector λ = |τ τ|, the upper bound in (23) becomes R(Ue), and, for any uncertainty set U given by (4) with mean vector τ, we have that

R(Ue) = min µ 1 τ Tµ + ϕ(µ) + |τ τ|T|µ|

1 τ Tµ + ϕ(µ ) + λT|µ | + (|τ τ| λ)T|µ |

= R(U) + (|τ τ| λ)T|µ |

with µ solution of (6) for uncertainty set U.

Near-optimal generalization bounds can be obtained using conﬁdence vectors that approximate |τ τ|. Speciﬁcally, for any uncertainty set U as above, we have that

R(Ue) R(U) + (|τ τ| λ)T|µ |

= 1 τ Tµ + ϕ(µ ) + |τ τ|T|µ |

1 τ Tµe + ϕ(µe) + λT|µe| + (|τ τ| λ)T|µ |

= R(Ue) + (|τ τ| λ)T(|µ | |µe|)

with µe solution of (6) for uncertainty set Ue. Therefore, the diﬀerence between the upper bound in (23) for uncertainty set U and the smallest upper bound R(Ue) is bounded by (|τ τ| λ)T(|µ | |µe|), and both terms in such scalar product are small when λ |τ τ|.

The minimax risk optimized at learning can oﬀer an adequate assessment of the MRC s error probability even if the uncertainty set used does not include the underlying

Mazuelas, Romero, and Gr unwald

distribution. Such minimax risk R(U) obtained as a byproduct of the learning process provides valid upper bounds for the error probability of MRCs in cases where |τ τ| λ, i.e., p U. As shown in (23) and (24), R(U) and R(U) still provide approximate bounds in other cases as long as the conﬁdence vector is not signiﬁcantly smaller than the error in the mean vector estimates.

High-conﬁdence upper and lower bounds for error probabilities can be obtained using conﬁdence vectors with high coverage probability. Such conﬁdence vectors can be obtained using the expressions in Theorem 4 or numerical methods such as those proposed in Waudby-Smith and Ramdas (2023). Those vectors may not be adequate for MRC learning since they are often much larger than |τ τ|. Notice that a conﬁdence vector λδ that ensures {λδ |τ τ|} occurs with high probability for any underlying distribution is likely to be much larger than |τ τ| for a particular training set drawn from a speciﬁc underlying distribution. However, conﬁdence vectors with high coverage probability can be used to obtain error probability bounds for general MRCs that are valid with probability at least 1 δ. A simple way to get such bounds from R(U) and R(U) corresponding with a conﬁdence vector λ follows by noticing that (23) and (24) imply that

R(U) (λδ λ)T|µ| R(h U) R(U) + (λδ λ)T|µ | (28)

with probability at least 1 δ because λδ |τ τ| with that probability. Another way to obtain high-conﬁdence bounds follows by solving the convex optimization problems in Theorem 5 for the MRC corresponding to λ and the uncertainty set corresponding to λδ.

Theorem 7 and the above discussion exhibit the diﬀerent roles played by conﬁdence vectors that aim to bound the error in mean vector estimates with high probability for any probability distribution versus those that only aim to approximate such error. The former can be used to obtain provably valid performance guarantees while the later can be used to obtain small error probability and approximate performance bounds. Such roles are further studied numerically in Section 6 using multiple real datasets. The improved performance of methods that use aggressively chosen parameters has been observed in multiple ﬁelds of machine learning. For instance, in on-line learning methods, the theoretical prescriptions for learning rates that lead to valid prediction guarantees are routinely outperformed by choices that minimize prediction error on the data seen so far but have worse guarantees (see e.g., Devaine et al. (2013)). This phenomenon is related to the fact that in-expectation generalization bounds can be signiﬁcantly better than in-probability generalization bounds, and it is further explored in a PAC-Bayes setting by Gr unwald et al. (2021). The generalization bounds in Theorem 7 shed light on such phenomenon for the proposed MRCs.

4.3 Universal consistency

We show next that MRCs can be strongly universally consistent using rich feature mappings, that is, as the training size grows, the MRC s error probability tends to the Bayes risk with probability one for any underlying distribution.

Minimax Risk Classifiers with 0 -1 Loss

The theorem below shows universal consistency for MRCs given by feature mappings determined by random features corresponding with rich RKHSs (see e.g., Rahimi and Recht (2008); Bach (2017)). Speciﬁcally, let v1, v2, . . . be a sequence of i.i.d. samples from distribution p(v) generating random features ψv : X R for kernel k : X X R, that is

Ep(v){ψv(x)ψv(x )} = k(x, x ), x, x X.

In addition, for n = 1, 2, . . . the feature map Φn is deﬁned by the Dn random features ψv1, ψv2, . . . , ψv Dn as

Φn(x, y) = ey [1, ψv1(x), ψv2(x), . . . , ψv Dn(x)]T. (29)

Using such feature mappings, we have the following result.

Theorem 8 For n = 1, 2, . . ., let hn be an MRC for uncertainty set Un given by (4) with Φ = Φn as in (29), τ = τ n as in (5) and λ = λn 0. If we have that

(1) k : X X R is a characteristic kernel,

(2) scalar features are bounded, i.e., there exists C > 0 such that |ψv(x)| < C for all v and x,

(3) the number of random features Dn deﬁning the feature mapping Φn is non-decreasing and tends to inﬁnity, and

(4) any component of {λn} is non-increasing and tends to 0.

Then, the sequence of smallest minimax risks {RΦn} tends to the Bayes risk RBayes with probability one for any underlying distribution p . If, in addition to (1)-(4), we have that

(5) any component of {λn p

n/ log n} tends to , and

(6) Dn = O(nk) for some k > 0.

Then, the sequence of MRCs probabilities of error {R(hn)} tends to the Bayes risk RBayes with probability one for any underlying distribution p .

Proof See Appendix G.

The conditions under which MRCs are strongly universally consistent are analogous to those corresponding to conventional ERM methods based on SRM and RKHSs (Steinwart, 2005; Zhang, 2004), e.g., regularization parameters that tend to zero not very quickly and broad RKHSs. The universal consistency in ERM techniques is achieved using RKHSs given by universal kernels while the theorem above uses characteristic kernels, which is in general a slightly weaker condition (Muandet et al., 2017). In addition, the result above provides MRCs universal consistency for binary and multiclass cases and does not rely on surrogate losses.

Mazuelas, Romero, and Gr unwald

Notice that for ERM methods based on SRM, the decrease of the regularization parameter for increasing number of samples corresponds to consider broader families of rules (balls in the RKHS with increased radius). On the other hand, for the proposed approach, such decrease corresponds to consider reduced uncertainty sets. This fact illustrates how the bias-complexity trade-oﬀaddressed in the SRM approach by controlling the generality of the family of classiﬁcation rules is analogous to controlling the generality of the uncertainty set of probability distributions in the proposed approach (see also discussion in Section 2.1 above). In paticular, the conditions (5) and (6) in Theorem 8 result in uncertainty sets that shrink as we get more samples but contain the underlying distribution with high probability.

5. Eﬃcient learning of MRCs

The learning stage for MRCs consists on solving optimization problem (6) that obtains the parameters for h U and h U d as well as the minimax risk R(U). This stage can be complemented by solving optimization problems (20) and (21) that can provide tight performance guarantees. In this section we ﬁrst show that such optimization problems can be simpliﬁed by using a reduced instances set given by instances samples. We then propose eﬃcient subgradient methods that take advantage of the speciﬁc structure of the above mentioned optimization problems.

5.1 Reduced instances set

Solving the optimization problems that learn MRCs and obtain their performance bounds can be ineﬃcient in cases where the feature mapping range {Φ(x, y) : x X, y Y} has a large cardinality, since the evaluation of objective functions in (6), (20) and (21) may require to search over such a range. The next result shows that this diﬃculty can be avoided by using instances samples instead of the whole set X, e.g., using the instances obtained at training.

Theorem 9 Let Xs = {x1, x2, . . . , xs} be s i.i.d. instances from the underlying distribution p (x), hs, Rs(U), and Rs(U) be the MRC and bounds obtained by solving (6) and (20) using Xs instead of X. If Rs(U) is ﬁnite and p U with probability at least 1 δ, we have that

Rs(U) εs(2C µl 1 + 1) R(hs) Rs(U) + εs(2C µu 1 + 1) R(U) + εs(2C µu 1 + 1)

with probability at least 1 2δ, where µl and µu are the solutions of (20) and (21), respectively, using Xs instead of X, Φ(x, y) C, x X, y Y, and εs is given by

4 + |Y|(m + 1) log s + log |Y|/δ

Proof See Appendix H.

The above result shows that a suﬃciently large set of instances samples Xs can be safely used instead of the whole set X in optimization problems (6) and (20) since the subsequent approximation error is O( p

log(s)/s). An analogous result can be obtained for deterministic MRCs and (21) using the same arguments as in Appendix H for the above result.

Minimax Risk Classifiers with 0 -1 Loss

The condition |Rs(U)| < can be easily satisﬁed in practice. Such condition is equivalent to Us = n p (Xs Y) : |Ep{Φ} τ| λ o =

that is directly achieved if τ is obtained as the sample mean of samples with instances Xs. In other cases, it can be ensured that such uncertainty set is not empty by increasing the conﬁdence vector and shifting the mean vector. In particular, if λ 1 and λ 2 are solutions of the linear optimization problem

min p,λ1,λ2 1T(λ1 + λ2)

s.t. τ λ1 X

x Xs,y Y p(x, y)Φ(x, y) τ + λ2

λ1 λ, λ2 λ, p (Xs Y)

with 2m + s|Y| variables and 4m + s|Y| + 1 constraints. Then, taking eτ = τ + λ 2 λ 1 2 and eλ = λ 1+λ 2 2 λ we have that

Us e Us = n p (Xs Y) : |Ep{Φ} eτ| eλ o = .

The numerical results of next Section 6 show that the optimization problems for MRC learning can be accurately solved in practice using quite reduced sets of instances. In particular, such results show that the set of instances X in optimization problems (6), (20), and (21) can be taken as the set Xn = {x1, x2, . . . , xn} of instances obtained at training.

5.2 Eﬃcient optimization

This section presents eﬃcient optimization techniques for MRC learning that comprises to solve problems (6), (20) and (21). These three problems can be written as:

min µ f(µ) := a Tµ + λT|µ| + max{Fµ + b} (30)

with a Rm, b Rp and F Rp m. The size of vector a and the number of columns of matrix F, m, equals the number of components of the feature mapping Φ : X Y Rm, while the size of vector b and the number of rows of matrix F, p, is given by the number of instances Xs used to evaluate ϕ, e.g., the number of instances at training. Speciﬁcally, p equals s(2|Y| 1) in problem (6), and s|Y| in problems (20) and (21). The optimization problem (30) can be reformulated as the linear program (LP)

min µ1,µ2 a T(µ1 µ2) + λT(µ1 + µ2) + ν

s.t. Fµ1 Fµ2 + b ν1 µ1, µ2 0

by introducing new variables ν R and µ1, µ2 Rm with µ = µ1 µ2. Such an LP has p + 2m constraints and 2m + 1 variables, and can be solved with high accuracy by

Mazuelas, Romero, and Gr unwald

oﬀ-the-shelf solvers at the expenses of a high computational time in cases where p or m is large. Problem (30) belongs to the class of nondiﬀerentiable convex optimization problems with subgradients that are uniformly bounded. The subgradient methods (SMs) are often an attractive option to solve this class of problems since they can eﬃciently provide solutions with adequate accuracy (Bertsekas, 2015). A subgradient of the objective function f in (30) at point µ can be directly obtained from the maximum of vector v = Fµ + b. Speciﬁcally, if i(µ) arg max v we have that a subgradient of f at µ is given by

g(µ) = a + λ sign(µ) + coli(µ)(FT) (32)

where denotes the Hadamard product, sign(µ) denotes the vector given by the signs of the components of µ, and coli( ) denotes the i-th column of the argument. Often, the main limitation of SMs is the large number of iterations required. The accelerated subgradient methods (ASMs) (Nesterov and Shikhman, 2015; Tao et al., 2020) have been developed to reduce the number of iterations by using the Nesterov s extrapolation strategy (Nesterov, 1983). In particular, Algorithm 1 describes the application of the ASM proposed in Tao et al. (2020) to problem (30) for MRC learning.

Algorithm 1 ASM for MRC learning

Output: approximate solution µ

1: Initialize µ1 and take c1 = θ1 = 1, η1 = 0

2: y1 µ1, v1 Fµ1 + b, i1 arg max v1, µ µ1, f a Tµ1 + λT|µ1| + v(i1) 1 3: for k = 1, 2, . . . do

4: gk a + λ sign(µk) + colik(FT)

5: yk+1 µk ckgk, µk+1 (1 + ηk)yk+1 ηkyk 6: vk+1 Fµk+1 + b, ik+1 arg max vk+1

7: ck+1 (k + 1) 3/2, θk+1 2 k+1, ηk+1 θk+1 1 θk 1

8: fk+1 a Tµk+1 + λT|µk+1| + v(ik+1) k+1 9: if fk+1 < f then

10: f fk+1, µ µk+1 11: end if

12: end for

The time complexity per iteration of the SMs described above can be high in cases where p or m is large. This computational cost is due to the fact that the subgradient g(µk) is computed by evaluating vk = Fµk + b that requires pm multiplications. Such computation can be carried out in a signiﬁcantly more eﬃcient manner by exploiting the speciﬁc structure of the subgradient. The vector vk can be eﬃciently computed from vk 1 because we have that Fg(µk 1) = Fa + Fdiag(λ)sign(µk 1) + coli(µk 1)(FFT)

and Fa, FFT, and Fdiag(λ) can be computed only once. In addition, the sequence of vectors {dk = Fdiag(λ)sign(µk)} can be obtained as

dk = dk 1 + Fdiag(λ) k (33)

Minimax Risk Classifiers with 0 -1 Loss

where the vector k = sign(µk) sign(µk 1) takes values 2 in the components where the parameters µk 1 and µk change signs, takes value 0 in the components where they have the same sign, and would take values 1 only if some of the components of such parameters are exactly zero. Therefore, the computation vector dk in (33) can be carried out very eﬃciently in practice by adding or subtracting the columns of 2Fdiag(λ) corresponding to the components where the iterates change sign. Algorithm 2 shows the eﬃcient implementation of ASM that exploits the speciﬁc structure of the subgradient as described above. The result below describes how such implementation can result in signiﬁcant computational savings.

Algorithm 2 Eﬃcient ASM for MRC learning

Output: approximate solution µ

1: Initialize µ1 and take c1 = θ1 = 1, η1 = 0, α = Fa, G = FFT, H = 2Fdiag(λ)

2: y1 µ1, v1 Fµ1 + b, w1 v1, s1 sign(µ1), d1 (1/2)Hs1 3: i1 arg max v1, µ µ1, f a Tµ1 + λT|µ1| + v(i1) 1 4: for k = 1, 2, . . . do

5: gk a + λ sk + colik(FT)

6: yk+1 µk ckgk, µk+1 (1 + ηk)yk+1 ηkyk 7: uk α + dk + colik(G)

8: wk+1 vk ckuk, vk+1 (1 + ηk)wk+1 ηkwk, ik+1 arg max vk+1 9: sk+1 sign(µk+1), k sk+1 sk, dk+1 dk 10: for i = 1, 2, . . . m do

11: if (i) k = 2 then

12: dk+1 dk+1 + coli(H)

13: else if (i) k = 2 then

14: dk+1 dk+1 coli(H)

15: else if (i) k { 1, 1} then

16: dk+1 dk+1 + (1/2)sign( (i) k )coli(H)

18: end for

19: ck+1 (k + 1) 3/2, θk+1 2 k+1, ηk+1 θk+1 1 θk 1

20: fk+1 a Tµk+1 + λT|µk+1| + v(ik+1) k+1 21: if fk+1 < f then

22: f fk+1, µ µk+1 23: end if

24: end for

Theorem 10 The sequences {µk} generated by Algorithms 1 and 2 are identical, while the computational complexity of Algorithm 2 is signiﬁcantly smaller at iterations k where sign(µk) diﬀers from sign(µk 1) in few components. Speciﬁcally, if γK is the sparsity coeﬃcient given by the average fraction of non-zero components of k for k = 1, 2, . . . , K; then, the computational complexity of Algorithm 2 after K iterations is O(KpmγK), while that of Algorithm 1 is O(Kpm).

Mazuelas, Romero, and Gr unwald

Proof See Appendix I.

The computation in step 6 of Algorithm 1 that has cost O(pm) is eﬀectively replaced by steps 7-18 in Algorithm 2 that have cost O pmγ( k) where γ( k) denotes the fraction of non-zero components of k. As the algorithm progresses and the subgradient steps decrease to zero, it is expected to have few changes in the signs of µk and therefore to have a highly sparse vector k. The usage of an ASM also contributes to a reduced number of sign changes since such method provides more stable iterations than basic subgradient methods. The method in Nesterov (2014) also exploits the sparsity in subgradient methods for piecewise linear functions but it considers problems given only by a term max{Fµ + b} and with a sparse matrix F. In order to further decrease the number of iterations, we also use restarts similarly to other methods for non-smooth optimization (Yang and Lin, 2018). Our numerical results conﬁrm the eﬃciency of the implementation described in Algorithm 2 which resulted in a signiﬁcant reduction (more than 90%) of the running time per iteration in comparison with Algorithm 1.

6. Numerical results

This section shows four sets of numerical results that describe how the proposed techniques can enable to learn MRCs eﬃciently and to obtain reliable and tight performance bounds at learning. We utilize 12 common datasets from the UCI repository (Dua and Graﬀ, 2017) with characteristics given in Table 1. MRCs implementation is available in the open-source Python library MRCpy (Bondugula et al., 2023) https://Machine Learning BCAM.github. io/MRCpy/. In this section, MRCs are implemented using a feature mapping given by random features corresponding with a Gaussian kernel (Rahimi and Recht, 2008; Bach, 2017), that is

Φ(x, y) = ey Ψ(x) = ey [cos u T 1 x, sin u T 1 x, cos u T 2 x, sin u T 2 x, . . . , cos u T Dx, sin u T Dx] (34)

where x is a normalized instance and u1, u2, . . . , u D are i.i.d. samples from a zero-mean Gaussian distribution with covariance (1/σ2)I for a scaling parameter σ. In addition, for n training samples (x1, y1), (x2, y2), . . . (xn, yn) we obtain conﬁdence vectors as

where υ is the vector formed by the sample variances of the feature mapping components. In all the numerical experiments, if not stated otherwise, we take λ0 = 0.3 and use D = 500 random Fourier features corresponding with the scaling parameter σ = p

d/2 for d the number of instances components. We compare MRCs performance and bounds with those obtained by Wasserstein-based RRM and by PAC-Bayes methods. Speciﬁcally, we utilize the distributionally robust logistic regression (DRLR) method as described in Shaﬁeezadeh Abadeh et al. (2015) with uncertainty sets deﬁned by Wasserstein distances, and a support vector machine (SVM) with upper bounds given by PAC-Bayes as described in Langford (2005); Ambroladze et al. (2007). The ﬁrst set of numerical results shows that MRCs optimization problems can be eﬃciently addressed in practice by using a reduced set of instances. Speciﬁcally, for s i.i.d.

Minimax Risk Classifiers with 0 -1 Loss

Table 1: Characteristics of the datasets used for experimentation. Data set Number of samples Instances dimensionality % Majority class Adult 48,842 14 76 Pulsar 17,898 8 91 Credit 690 15 56 QSAR 1055 41 66 Mammographic 961 5 54 Haberman 306 3 74 Ion 351 34 64 Heart 270 13 56 Liver 583 10 71 Blood 748 4 76 Diabetes 768 8 65 Audit 776 26 61

Number of instances s

Classif cation error

Lower bound Rs(U)

Upper bound Rs(U) Error prob. R(hs) Lower bound R(U)

Upper bound R(U) Error prob. R(h)

1000 2000 3000 4000 5000

(a) Adult dataset

Number of instances s

Classif cation error

Lower bound Rs(U)

Upper bound Rs(U) Error prob. R(hs) Lower bound R(U)

Upper bound R(U) Error prob. R(h)

1000 2000 3000 4000 5000

(b) Pulsar dataset

Figure 1: Decrease of optimization error with the number of instances used.

instances Xs = {x1, x2, . . . , xs}, we solve optimization problems (6) and (20) taking X = Xs and we study how the error of such approximation decreases with increasing s. For these numerical results we use Adult and Pulsar datasets from UCI repository (Dua and Graﬀ, 2017) that contain a total of 48,842 and 17,898 instances, respectively. We obtain τ and λ using 500 training samples. Then, we solve (6) and (20) taking X composed by all the instances in the dataset and taking X composed by a subset of randomly selected instances Xs of size s.

In Figure 1, R(hs), Rs(U) and Rs(U) denote, as in Theorem 9, the error probability and bounds of the MRC obtained solving (6) and (20) taking X = Xs = {x1, x2, . . . , xs}. Solid curves describe the average results in 50 random repetitions, the shaded areas describe the intervals formed by the standard deviation around the averages, and dashed lines describe the results obtained taking X composed by all the instances. The ﬁgure shows that the

Mazuelas, Romero, and Gr unwald

Optimization error

Running time [s]

E-ASM E-ASM-R

(a) Credit dataset

Optimization error

Running time [s]

E-ASM E-ASM-R

(b) Haberman dataset

Optimization error

Running time [s]

E-ASM E-ASM-R

(c) Mammographic dataset

10 3 2 3 4 5

Optimization error

Running time [s]

E-ASM E-ASM-R

(d) Diabetes dataset

Figure 2: The proposed eﬃcient ASM s implementation that exploits the subgradients structure can signiﬁcantly reduce the running time required to accurately learn MRCs.

results obtained using a reduced set of instances quickly converge to those obtained using all the instances. These results agree with the theoretical guarantees for this approximation shown in Theorem 9 and show that the constants of the bounds in such result can be quite small in practice. In the remaining numerical results we solve optimization problems (6), (20), and (21) taking X as the n instances at training.

The second set of numerical results assesses the performance of the eﬃcient SMs presented in Section 5.2. We solve problem (6) for MRC learning using four methods: the basic SM (BSM) with step size ck = 1/(

k + 1 gk ); the eﬃcient BSM that exploits the subgradient structure as shown in Section 5.2 (E-BSM); the ASM described in Algorithm 1; the eﬃcient ASM that exploits the subgradient structure as detailed in Algorithm 2 (E-ASM); and the eﬃcient ASM described in Algorithm 2 combined with restarts every 10,000 iterations (E-ASM-R).

Minimax Risk Classifiers with 0 -1 Loss

Figure 2 shows the optimization error per running time averaged over 10 random partitions of four datasets. The results manifest the signiﬁcant computational savings obtained by the eﬃcient implementation presented in Section 5.2. Such eﬃcient implementation is especially advantageous for ASM since it results in a sparsity coeﬃcient around 10 3 while that achieved in BSM is around 10 1. The restart strategy achieves worse computing time per iteration since it results in sparsity coeﬃcients around 10 2. However, such strategy provides an overall improvement in running time since it requires signiﬁcantly less iterations.

The third set of numerical results shows how the MRCs performance and bounds change by varying the conﬁdence vectors used. We study the eﬀect of varying the conﬁdence vectors at learning and the performance guarantees obtained using conﬁdence vectors with high coverage probability. In addition, we compare the results obtained in the practical case where conﬁdence vectors are obtained from training samples with those obtained in the ideal case where the error in the mean vector estimates is known. Speciﬁcally, for the practical case we learn MRCs using λ as in (35) and obtain high-conﬁdence performance guarantees using λδ for δ = 0.05 given by the conﬁdence intervals proposed in Waudby-Smith and Ramdas (2023). For the ideal case, we learn MRCs using λ = λ0|τ τ|, and obtain high-conﬁdence performance guarantees using λδ = |τ τ|. In particular, a simple high-conﬁdence upper bound is obtained using (28), and a tighter high-conﬁdence upper bound is obtained using Theorem 5 with λ = λδ. In order to reproduce the ideal case, in these numerical results we use Adult and Pulsar datasets, the mean τ is calculated using all the samples while 1, 000 samples are randomly sampled for training in 50 repetitions.

Figure 3 shows the error probability R(h U) and upper bound R(U) of MRCs corresponding with diﬀerent values of λ0, together with the simple high-conﬁdence upper bound obtained using (28), and the tighter high-conﬁdence upper bound obtained using Theorem 5. Such ﬁgure shows that MRCs do not heavily rely on the choice of the conﬁdence vector. In particular, the error probability obtained using λ as in (35) is similar to the error probability that would be obtained using the actual diﬀerence |τ τ|, and the minimax risk R(U) optimized at learning is similar to the error probability for most values of λ0. In addition, the usage of conﬁdence vectors with high coverage probability can result in performance guarantees that are both valid with high probability and informative. Such numerical results are in agreement with the conclusions drawn from Theorem 7 in Section 4.2. In particular, Figure 3 shows that the smallest valid upper bound is obtained using λ = |τ τ| but more aggressive conﬁdence vectors can result in improved error probability and yet provide useful performance bounds.

We also compare the performance and bounds of MRCs for varying conﬁdence vectors with those obtained by DRLRs for varying Wasserstein radius. Figure 4 shows the error probability and performance bounds of MRCs and DRLRs obtained by varying 0 λ0 1 and the Wasserstein radius 10 5 ρ 1, respectively. In particular, for each value of λ0 and ρ we averaged over 10-fold stratiﬁed partitions the error and bounds of MRCs and DRLRs. Figure 4 shows that MRCs can provide tighter performance bounds than existing techniques based on RRM with Wasserstein distances. The ﬁgure also shows that, even for datasets of similar size, DRLR methods require to ﬁne-tune the Wasserstein radius ρ in order to obtain reliable and tight performance bounds while MRCs can utilize a conﬁdence parameter λ0 that does not strongly depend on the dataset. Figure 4 also shows that deterministic 0 -1

Mazuelas, Romero, and Gr unwald

Classif cation error

Upper bound Tight high conf. upper bound Simple high conf. upper bound Error probability 0.4

0 0 0.5 1 1.5 2

(a) Adult dataset, ideal case

Classif cation error

Upper bound Tight high conf. upper bound Simple high conf. upper bound Error probability 0.4

0 0 0.5 1 1.5 2

(b) Adult dataset, practical case

Classif cation error

Upper bound Tight high conf. upper bound Simple high conf. upper bound Error probability

0 0 0.5 1 1.5 2

(c) Pulsar dataset, ideal case (d) Pulsar dataset, practical case

Figure 3: Error probabilities and performance bounds of MRCs for diﬀerent choices of conﬁdence vectors in comparison with the ideal case that use the actual error in mean vector estimates.

Minimax Risk Classifiers with 0 -1 Loss

(a) MRC performance bounds and classiﬁcation error in credit dataset.

(b) DRLR performance bounds and classiﬁcation error in credit dataset.

(c) MRC performance bounds and classiﬁcation error in QSAR dataset.

(d) DRLR performance bounds and classiﬁcation error in QSAR dataset.

(e) MRC performance bounds and classiﬁcation error in mammographic dataset.

(f) DRLR performance bounds and classiﬁcation error in mammographic dataset.

Figure 4: MRCs performance bounds for varying conﬁdence vectors in comparison with those obtained by RRM techniques based on Wasserstein distances.

Mazuelas, Romero, and Gr unwald

Table 2: Classiﬁcation error of MRC with model selection based on the performance bounds in comparison with state-of-the-art techniques.

Data set SVM-CV SVM-PAC DRLR-UB MRC-UB Det. MRC-UB Haberman .26 .27 .32 .25 .25 Heart .17 .17 .23 .21 .18 Liver .28 .28 .33 .29 .28 Blood .22 .23 .25 .24 .22 Credit .14 .13 .20 .15 .14 Diabetes .23 .23 .30 .27 .24 Ion .08 .08 .07 .12 .07 QSAR .11 .12 .14 .18 .12 Mammographic .17 .17 .21 .19 .18 Audit .05 .06 .04 .07 .06

MRCs, h U d, can obtain improved classiﬁcation error in practice at the expenses of less tight performance guarantees. The fourth set of numerical results shows how MRCs performance bounds can be used for model selection. We compare the performance obtained by selecting the scaling parameter of a Gaussian kernel using four methods, one based on conventional cross-validation and three based on error upper bounds. Speciﬁcally, SVM-CV selects the scaling parameter for which a SVM classiﬁer achieves the smallest cross-validation error over 10-fold partitions of the training data. SVM-PAC, DRLR-UB, and MRC-UB select the scaling parameter with smallest error upper bound in training. Such upper bounds are obtained for SVM-PAC by using PAC-Bayes methods as in Langford (2005); Ambroladze et al. (2007), while for DRLR-UB and MRC-UB the upper bounds are obtained as the value the optimization problem solved at learning as shown in Shaﬁeezadeh-Abadeh et al. (2015) and in equation (6), respectively. For all datasets, MRCs utilize conﬁdence vectors as in (35) with λ0 = 0.3 and DRLRs utilize Wasserstein radious of ρ = 0.003 as in Shaﬁeezadeh-Abadeh et al. (2015). Table 2 shows the classiﬁcation error obtained by the methods compared in 10 datasets. Speciﬁcally, we generate 20 random stratiﬁed splits with 20% test samples. In each split, we utilize the training samples to learn a classiﬁer selecting a scaling parameter over 20 uniformly spaced candidates between the 10th and 90th percentiles of the Euclidean distances among normalized instances. Then, the error of the corresponding classiﬁer is estimated using the test samples and the ﬁnal values in the table are obtained by averaging the results over the 20 random splits of the data. Table 2 shows that the proposed MRCs as well as methods based on PAC-Bayes can obtain similar performance to that obtained by conventional methods based on cross-validation. However, MRCs and PAC-Bayes methods require a signiﬁcantly smaller complexity since they only need to carry out one optimization per parameter value while cross-validation methods require to carry out such optimization 10 times per parameter value. In addition, the performance bounds obtained by MRCs can be used not only for model selection but also to obtain tight estimates for the classiﬁcation error. Figure 5 shows the diﬀerences between performance bounds and error probabilities for all

Minimax Risk Classifiers with 0 -1 Loss

Figure 5: Boxplots with diﬀerences between error probability and bounds for PAC, DRLR, and MRC methods over multiple datasets and hyperparameter conﬁgurations.

the splits, datasets, and hyper-parameters in this fourth set of numerical results. The ﬁgure shows that the performance bounds provided by MRCs are much tighter than those based on PAC-Bayes and much more reliable that those based on RRM that use Wasserstein balls without a ﬁne-tuned radious. In particular, the diﬀerence between the classiﬁcation error and the performance bounds are around 0.05 for MRCs and around 0.1 for deterministic MRCs.

7. Conclusion

The paper presents minimax risk classiﬁers (MRCs) that minimize the worst-case 0 -1 loss over general classiﬁcation rules and provide tight performance guarantees. We show how the out-of-sample performance of MRCs can be reliably estimated at learning, and that the MRCs error due to ﬁnite training sizes is determined by the accuracy of expectation estimates. In addition, we show that MRCs are strongly universally consistent in situations analogous to those corresponding with kernel-based methods. The proposed methodology can oﬀer classiﬁcation techniques that do not rely on speciﬁc training samples and choices for surrogate losses/hypothesis classes. Instead, MRC learning is based on expectation estimates, and its inductive bias comes only from a feature mapping that determines which expectations are estimated. Therefore, the methods presented can provide techniques that are robust to practical situations that defy common assumptions, e.g., training samples that follow a diﬀerent distribution or display heavy tails.

Mazuelas, Romero, and Gr unwald

Acknowledgments

Funding in direct support of this work has been provided by projects PID2022-137063NB-I00, CNS2022-135203, and CEX2021-001142-S funded by MCIN/AEI/10.13039/501100011033 and the European Union Next Generation EU /PRTR, and programes ELKARTEK and BERC-2022-2025 funded by the Basque Government.

Appendix A.

Lemma 11 Let U be an uncertainty set given by (4). For any h T(X, Y), we have that

sup p U ℓ(h, p) inf µ Rm 1 τ Tµ + sup x X,y Y

Φ(x, y)Tµ h(y|x) + λT|µ| (36)

In addition, if U that satisﬁes R1 or R2, we have that

sup p U ℓ(h, p) = min µ Rm 1 τ Tµ + sup x X,y Y

Φ(x, y)Tµ h(y|x) + λT|µ| (37)

Proof In the ﬁrst step of the proof we show that the right hand side of (36) is equivalent to the Fenchel dual of the left hand side, then the second step of the proof shows that strong duality holds if U satisﬁes R1 or R2. Let 0 r m be the number of non-zero components of λ (without loss of generality we assume such components are the r ﬁrst components of λ), and M(X Y) be the set of signed Borel measures over X Y with bounded total variation. The set M(X Y) is a Banach space with the total variation norm (see e.g., Chapter 10 in Aliprantis and Border (1994)). If A is the linear mapping

A : M(X Y) Rr+m+1

p 7 [ R Φ1(x, y)dp(x, y), R Φ1(x, y)dp(x, y), R Φ2(x, y)dp(x, y), R dp(x, y)] (38)

where Φ1(x, y) Rr denotes the ﬁrst r components of Φ(x, y) and Φ2(x, y) Rm r denotes the last m r components of Φ(x, y). A is bounded and its adjoint operator is

A : Rr+m+1 F(X Y) (M(X Y))

µ1, µ2, µ3, ν 7 Φ1( , )T(µ1 µ2) + Φ2( , )Tµ3 + ν. (39)

where F(X Y) is the set of bounded Borel measurable functions over X Y. Then, we have that

sup p U ℓ(h, p) = sup p U 1 Z h(y|x)dp(x, y) = 1 inf p M(X Y) f(p) + g(A(p)) (40)

where f and g are the lower semi-continuous convex functions

g : Rr+m+1 R { }

(a1, a2, a3, b) 7 0 if a1 τ 1 + λ1, a2 τ 1 + λ1, a3 = τ 2, b = 1 otherwise (41)

Minimax Risk Classifiers with 0 -1 Loss

for τ 1, λ1 Rr (resp. τ 2, λ2 Rm r) given by the ﬁrst r (resp. last m r) components of τ and λ, and

f : M(X Y) R { }

p 7 R h(y|x)dp(x, y) if p is nonnegative otherwise. (42)

Then, the Fenchel dual (see e.g., Borwein and Zhu (2004)) of (40) is

1 sup µ1,µ2,µ3,ν f (A (µ1, µ2, µ3, ν)) g ( µ1, µ2, µ3, ν) (43)

where f and g are the conjugate functions of f and g. If w F(X Y), we have that

f (w) = sup p 0

Z (w(x, y) h(y|x))dp(x, y)

= 0 if w(x, y) h(y|x), x, y X Y otherwise

and g ( µ1, µ2, µ3, ν) is given by

sup a T 1 µ1 a T 2 µ2 τ T 2 µ3 ν s.t. a1 τ 1 + λ1, a2 τ 1 + λ1

= (τ 1 + λ1)Tµ1 + (τ 1 λ1)Tµ2 τ T 2 µ3 ν if µ1 0, µ2 0 otherwise.

Hence, the dual problem (43) becomes

inf µ1,µ2,µ3,ν 1 (τ 1 + λ1)Tµ1 ( τ 1 + λ1)Tµ2 τ T 2 µ3 ν

s.t. Φ1(x, y)T(µ1 µ2) + Φ2(x, y)Tµ3 + ν h(y|x), (x, y) X Y µ1 0, µ2 0

= inf µ1,µ2,µ3,ν 1 + (τ 1 + λ1)Tµ1 + ( τ 1 + λ1)Tµ2 τ 2µ3 ν

s.t. Φ1(x, y)T(µ2 µ1) + Φ2(x, y)Tµ3 + ν h(y|x), (x, y) X Y µ1 0, µ2 0 (44)

= inf µ,ν 1 τ Tµ + λT|µ| ν

s.t. Φ(x, y)Tµ + ν h(y|x), (x, y) X Y (45)

= inf µ 1 τ Tµ + sup (x,y) X Y

Φ(x, y)Tµ h(y|x) + λT|µ|. (46)

The expression in (45) is obtained taking µ = [(µ2 µ1)T, µT 3 ]T. Firstly, in (44) we can consider only pairs µ1, µ2 such that µ1 + µ2 = |µ1 µ2| because for any pair µ1, µ2 feasible in (44), we have that µ1 = (µ2 µ1)+, µ2 = (µ1 µ2)+ is a feasible pair because µ1 µ2 = µ1 µ2, and we also have that λT 1 | µ1 µ2| = λT 1 ( µ1 + µ2) λT 1 (µ1 + µ2). Then, we obtain (45) from (44) because

τ T 1 (µ2 µ1) + τ T 2 µ3 = τ Tµ

λT 1 |µ2 µ1| = λT|µ|, since λ2 = 0

Φ1(x, y)T(µ2 µ1) + Φ2(x, y)Tµ3 = Φ(x, y)Tµ.

Mazuelas, Romero, and Gr unwald

The expression in (46) is obtained since for any feasible (µ, ν) in (45) we have that (µ, ν) is feasible if ν = inf (x,y) X Y

h(y|x) Φ(x, y)Tµ

and ν ν. Then, the inequality in (36) follows by weak duality. For the second step of the proof, if U satisﬁes R1, we have that strong duality holds because the dual becomes the Lagrange dual and the constraints in (40) are linear aﬃne (see e.g., Chapter 5 in Boyd and Vandenberghe (2004)). If U satisﬁes R2, we show in the following that strong duality holds because 0 int(dom g Adom f) (see e.g., Chapter 4 in Borwein and Zhu (2004)), where dom denotes the set where an extended-valued function takes ﬁnite values, and int denotes the interior of a set. If U satisﬁes R2.1, there exists R > 0 such that R < λ(i) |EpΦ(i)(x, y) τ (i)| for i = 1, 2, . . . , m. Then, the second step of the proof is obtained by showing that if 0 < ε < R/(R + 1 + Ep{Φ} 2) < 1, we have that the ball with radius ε centered in 0 Rr+m+1, B(0, ε) satisﬁes B(0, ε) (dom g Adom f) Rr+m+1. We have that for any z B(0, ε) R2m+1, there exist ξ1, ξ2 B(0, R) Rm such that z(1), z(2), . . . , z(m) = Ep{Φ} + ξ1 (1 z(2m+1))Ep{Φ} z(m+1), z(m+2), . . . , z(2m) = Ep{Φ} + ξ2 + (1 z(2m+1))Ep{Φ}

z(2m+1) = 1 (1 z(2m+1))

because we have that

z(1), z(2), . . . , z(m) z(2m+1)Ep{Φ} 2 ε(1 + Ep{Φ} 2) < R

z(m+1), z(m+2), . . . , z(2m) + z(2m+1)Ep{Φ} 2 ε(1 + Ep{Φ} 2) < R.

Then, the result is obtained observing that Ep{Φ} + ξ1, Ep{Φ} + ξ2, 1 dom g

because R < λ(i) |EpΦ(i)(x, y) τ (i)| for i = 1, 2, . . . , m, and

(1 z(2m+1))Ep{Φ}, (1 z(2m+1))Ep{Φ}, (1 z(2m+1)) Adom f

because |z(2m+1)| ε < 1 and hence (1 z(2m+1))p is a nonnegative measure. If U satisﬁes R2.2, we have that Ep{Φ} int(Conv S), where S denotes the support of Φ(x, y) if (x, y) p, and Conv denotes the convex hull of a set. Firstly, int(Conv S) = because S and hence (Conv S) S are not contained in a proper aﬃne subspace. In the case that Ep{Φ} Conv S\Conv S, using the Hanh-Banach separation theorem (see e.g., Theorem 2, Sec 5.12 in Luenberger (1997)), there would exists a hyperplane Γ = {z; αTz = c} Rm

such that Ep{Φ} Γ and αTz c for all z Conv S. Then, the real-valued random variable αTΦ(x, y) c with (x, y) p is nonnegative with probability one and has expectation zero. Therefore, αTΦ(x, y) = c with probability one, and we would have that the S is contained in the hyperplane Γ, which leads to a contradiction. Let R > 0 be such that

Minimax Risk Classifiers with 0 -1 Loss

1. R < λ(i) |EpΦ(i)(x, y) τ (i)| for i = 1, 2, . . . , r, and

2. for any ξ in the Euclidean ball of Rm centered at 0 with radius R, B(0, R), we have that Ep{Φ}+ξ Conv S, that is, there exists pξ (X Y) such that Epξ{Φ} = Ep{Φ}+ξ.

As in the previous case, the second step of the proof is obtained by showing that if 0 < ε < R/(R + 1 + Ep{Φ} 2) < 1, we have that B(0, ε) (dom g Adom f) Rr+m+1. If r = 0, we have that for any z B(0, ε) Rm+1, there exist ξ3 B(0, R) such that z(1), z(2), . . . , z(m) = Ep{Φ} (1 z(m+1))(Ep{Φ} + ξ3)

z(m+1) = 1 (1 z(m+1))

because we have that

z(1), z(2), . . . , z(m) + z(m+1)Ep{Φ} 2 1 zm+1 ε 1 ε(1 + Ep{Φ} 2) < R.

Then, the result is obtained observing that Ep{Φ}, 1 dom g

and (1 z(m+1))(Ep{Φ} + ξ3), (1 z(m+1)) Adom f

because |z(m+1)| ε < 1 and hence (1 z(m+1))pξ3 is a nonnegative measure, for pξ3 satisfying Epξ3{Φ} = Ep{Φ} + ξ3. If 0 < r < m, we have that for any z B(0, ε) Rr+m+1, there exist ξ1, ξ2 B(0, R) Rr, and ξ3 B(0, R) Rm r such that z(1), z(2), . . . , z(r) = Ep{Φ1} + ξ1 (1 z(r+m+1))Ep{Φ1} z(r+1), z(2), . . . , z(2r) = Ep{Φ1} + ξ2 + (1 z(r+m+1))Ep{Φ1} z(2r+1), z(2r+2), . . . , z(r+m) = Ep{Φ2} (1 z(r+m+1))(Ep{Φ2} + ξ3)

z(r+m+1) = 1 (1 z(r+m+1))

because we have that

z(1), z(2), . . . , z(m) z(r+m+1)Ep{Φ1} 2 ε(1 + Ep{Φ} 2) < R

z(r+1), z(r+2), . . . , z(2r) + z(r+m+1)Ep{Φ1} 2 ε(1 + Ep{Φ} 2) < R

z(2r+1), z(2r+2), . . . , z(r+m) + z(r+m+1)Ep{Φ2} 2 1 zr+m+1 ε 1 ε(1 + Ep{Φ} 2) < R.

Then, the result is obtained observing that Ep{Φ1} + ξ1, Ep{Φ1} + ξ2, Ep{Φ2}, 1 dom g

because R < λ(i) |EpΦ(i)(x, y) τ (i)| for i = 1, 2, . . . , r, and

(1 z(r+m+1)) Ep{Φ1}, Ep{Φ1}, Ep{Φ2} + ξ3, 1 Adom f

Mazuelas, Romero, and Gr unwald

because |z(r+m+1)| ε < 1 and hence (1 z(r+m+1))pξ3 is a nonnegative measure, for pξ3 satisfying Epξ3{Φ1} = Ep{Φ1} and Epξ3{Φ2} = Ep{Φ2} + ξ3 that exists because (0T, ξT 3 )T 2 R. Finally, since strong duality holds and U is not empty we have that the optimal value in (37) is ﬁnite and hence the optimal in the dual is attained (Borwein and Zhu, 2004) and the inf in (36) becomes min .

Appendix B. Proof of Theorem 2

Proof Using Lemma 11 we have that

inf h T(X,Y) sup p U ℓ(h, p) = inf h,µ 1 τ Tµ + λT|µ| + sup x X,y Y {Φ(x, y)Tµ h(y|x)}

= min µ 1 τ Tµ + λT|µ| + inf h sup x X,y Y {Φ(x, y)Tµ h(y|x)}

inf h sup x X,y Y {Φ(x, y)Tµ h(y|x)} = inf h,ν ν

s.t. Φ(x, y)Tµ h(y|x) ν, x X, y Y.

Then, the result is obtained because

Φ(x, y)Tµ h(y|x) ν, x X, y Y h(y|x) Φ(x, y)Tµ ν, x X, y Y

y C (Φ(x, y)Tµ ν) 1, x X, C Y

P y C Φ(x, y)Tµ 1

|C| , x X, C Y

with ϕ(µ) given by (7). For each µ, there exist classiﬁcation rules h satisfying

h(y|x) Φ(x, y)Tµ ϕ(µ), x X, y Y

by deﬁnition of ϕ(µ). Hence, such classiﬁcation rules are solution of

inf h sup x X,y Y {Φ(x, y)Tµ h(y|x)}

with optimal value ϕ(µ).

Minimax Risk Classifiers with 0 -1 Loss

Appendix C. Proof of Theorem 3

Proof The result can be proven analogously of that in Theorem 2 shown in Appendix B. Firstly, for each h T(X, Y), we have that

supp V ℓ(h, p) = 1 min p h Tp + I+(p)

s.t. P y Y p(xi, y) = 1

n, i = 1, 2, . . . , n τ λ ΦTp τ + λ

where p, h, and Φ denote the vectors and matrix with rows p(xi, y), h(y|xi) and Φ(xi, y)T, respectively, for y Y, i = 1, 2, . . . , n, and

I+(p) = 0 if p 0 otherwise.

Optimization problem (47) has Fenchel (Lagrange) dual

1 max µ1,µ2,ν τ λ Tµ1 τ + λ Tµ2 1

n Pn i=1 ν(i) f (Φ(µ1 µ2) eν)

s.t. µ1, µ2 0

where eν is the vector in Rn|Y| with component corresponding with (xi, y) for i = 1, 2, . . . , n, y Y given by ν(i), and f is the conjugate function of f(p) = h Tp + I+(p) given by

f (w) = sup p 0 w Tp h Tp = 0 if w h otherwise.

Therefore, the Lagrange dual above becomes

1 max µ1,µ2,ν τ λ Tµ1 τ + λ Tµ2 1

n Pn i=1 ν(i)

s.t. µ1, µ2 0 Φ(xi, y)T(µ1 µ2) ν(i) h(y|xi), y Y, i = 1, 2, . . . , n.

It is easy to see that the solution of such optimization problem µ1, µ2 satisﬁes that µ(i) 1 µ(i) 2 = 0 for any i such that λi > 0. Then λT( µ1 + µ2) = λT| µ1 µ2| and taking µ = µ1 µ2 the Lagrange dual above is equivalent to

1 max µ,ν τ Tµ λT|µ| 1

n Pn i=1 ν(i)

Φ(xi, y)Tµ ν(i) h(y|xi), y Y, i = 1, 2, . . . , n

that has the same value as supp V ℓ(h, p) since the constraints in (47) are aﬃne and V is non-empty. Therefore,

inf h T(X,Y) sup p V ℓ(h, p) = inf h,µ,ν 1 τ Tµ + λT|µ| + 1

Φ(xi, y)Tµ ν(i) h(y|xi), y Y, i = 1, 2, . . . , n

Mazuelas, Romero, and Gr unwald

and, similarly to the proof for Theorem 2, we have that

Φ(xi, y)Tµ ν(i) h(y|xi), y Y, i = 1, 2, . . . , n

y C Φ(xi, y)Tµ ν(i) 1, C Y, i = 1, 2, . . . , n

P y C Φ(xi, y)Tµ 1

|C| , C Y, i = 1, 2, . . . , n

ν(i) ϕ(µ, xi), i = 1, 2, . . . , n.

Therefore, for each µ, we have that any classiﬁcation rule satisfying

h(y|x) Φ(x, y)Tµ ϕ(µ, x), x X, y Y

is solution of

inf h,ν 1 n

Φ(xi, y)Tµ ν(i) h(y|xi), y Y, i = 1, 2, . . . , n

that has optimal value 1

n Pn i=1 ϕ(µ, xi). Then, the result is obtained because for any x X, we have that X

Φ(x, y)Tµ ϕ(µ, x)

because otherwise there would exist νx < ϕ(µ, x) such that

Φ(x, y)Tµ νx

+ = max C Y

y C Φ(x, y)Tµ νx

which contradicts the deﬁnition of ϕ(µ, x).

Appendix D. Proof of Theorem 4

Proof The ﬁrst result is a direct consequence of Hoeﬀding s inequality and the union bound since each component of Φ is bounded by its corresponding scalar feature. Similarly, the second result is a consequence of the empirical Bernstein inequality in Maurer and Pontil (2009).

Minimax Risk Classifiers with 0 -1 Loss

For the last result, we have that for each j Y

i=1 ψ(xi)I{yi = j} E ψ(x)I{y = j}

i=1 ψ(xi)I{yi = j} p (y = j)E{ψ(x|y = j)}

i=1 ψ(xi)I{yi = j} E{ψ(x|y = j)}

+ E{ψ(x|y = j)} nj

n p (y = j)

i=1 ψ(xi)I{yi = j} E{ψ(x|y = j)}

n p (y = j)

For the ﬁrst term, we have that with probability at least 1 δ

i=1 ψ(xi)I{yi = j} E{ψ(x|y = j)}

2Rnj(F) + 2C p

log 2/δ p2nj

using the uniform concentration bound in terms of Rademacher complexities (see e.g., Mohri et al. (2018)). For the second term, we have that with probability at least 1 δ

n p y(y = j)

using Hoeﬀding s inequality. Therefore, the result is obtained using the union bound.

Appendix E. Proof of Theorem 5

Proof The result for R(U, h) is a direct consequence of Lemma 11, and the second result for R(U, h) is obtained analogously since

inf p U ℓ(h, p) = sup p U 1 Z h(y|x) dp(x, y)

inf µ 1 τ Tµ + λT|µ| + sup x X,y Y {Φ(x, y)Tµ + h(y|x)}

that leads to the expression in (20) changing the notation for the variable in the optimization from µ to µ.

Mazuelas, Romero, and Gr unwald

Appendix F. Proof of Theorem 7

Proof Let U be the uncertainty set in (22). It is clear that p U , then using Theorem 5

R(h U) R(U , h U) = min µ 1 τ T µ + sup x X,y Y {Φ(x, y)Tµ h U(y|x)}

1 τ T µ + sup x X,y Y {Φ(x, y)Tµ h U(y|x)}

1 τ T µ + sup x X,C Y

P y C Φ(x, y)Tµ 1

= R(U) + (τ τ )Tµ λT|µ |

where (48) is due to the deﬁnition of h U. Analogously,

R(h U) R(U , h U) = max µ 1 τ T µ + inf x X,y Y{Φ(x, y)Tµ h U(y|x)}

1 τ T µ + inf x X,y Y{Φ(x, y)Tµ h U(y|x)}

= R(U) + (τ τ )Tµ + λT|µ|.

For inequality (25), note that from (48) and using the deﬁnition of µ we have that

R(h U) 1 τ Tµ + sup x X,C Y

P y C Φ(x, y)Tµ 1

|C| + λT|µ | λT|µ | + (τ τ )µ

= RΦ + (τ τ)T(µ µ ) + λT(|µ | |µ |)

Finally, the results in (26) and (27) are obtained analogusly as the previous result using U instead of U .

Appendix G. Proof of Theorem 8

Proof In the ﬁrst step of the proof we show that if Vn is the uncertainty set

Vn = {p (X Y) : τ n λn Ep{Φn} τ n + λn}

with τ n = Ep {Φn} and λn satisfying condition (4) in the theorem s statement, then, we have that {R(Vn)} tends to RBayes with probability one for any underlying distribution p . In the second step of the proof we obtain the result using Theorem 7 and the Borel-Cantelli Lemma. For the ﬁrst step, we have that R(Vn) for n = 1, 2, . . . is a non-increasing sequence because for n = 1, 2, . . . the sequence Dn is non-decreasing and any component of λn is non-increasing. In addition, R(Vn) for n = 1, 2, . . . is lower bounded by RBayes since p Vn

for any n. Then, the ﬁrst step in the proof is obtained showing that RBayes is the largest lower bound and using the monotone convergence theorem. Speciﬁcally, in case there exists L R such that for all n R(Vn) L > RBayes = R({p })

Minimax Risk Classifiers with 0 -1 Loss

then, it would exist a distribution p = p with p Vn for all n because Vn1 Vn2 if n1 n2. Since p , p Vn for all n, we have that

2λn Ep {Φn} Ep {Φn} 2λn (49)

In particular, for all n Ep ey Ep ey 2 λn

so that p y = p y. Using again (49) and the deﬁnition of Φn, denoting Ψn(x) = [ψv1(x), ψv2(x), . . . , ψv Dn(x)]T we get that

1 Dn Ep {ey Ψn} Ep {ey Ψn} 2 2 4|Y| λn 2 n 0

because any component of λn tends to 0. Therefore, for all y Y we have that

x X Ψn(x)dp (x, y) Z

x X Ψn(x)dp (x, y) 2

so that for j Y such that p y(j) = 0

x X Ψn(x)dp x|y=j Z

x X Ψn(x)dp x|y=j 2

Ψn(x)TΨn(x )

Dn d(p x|y=j p x|y=j)d(p x |y=j p x |y=j) n 0.

Using the law of large numbers we have that with probability one

Ψn(x)TΨn(x )

Dn n k(x, x )

x X,x X k(x, x )d(p x|y=j p x|y=j)d(p x |y=j p x |y=j) = 0

Z k(x, )dp x|y=j Z k(x, )dp x|y=j

for H the RKHS given by kernel k. But equality (50) contradicts p = p because k is a characteristic kernel and p y = p y for all y Y. As a consequence of the previous result, we also get that RΦn converges with probability one to RBayes since the smallest minimax risk satisﬁes RΦn = R(Un ) with

Un = {p (X Y) : Ep{Φn(x, y)} = τ n }

that coincides with Vn above taking λn = 0 for all n. For the second step, if λn and Dn satisfy the two additional conditions in the theorem s statement, let N0 be an integer such that any component of λn is larger than

2 log(|Y|(Dn + 1)) + 2 log 2n2

Mazuelas, Romero, and Gr unwald

for any n N0. Such N0 exists since any component of λn p

n/ log n tends to and Dn = O(nk) for some k > 0. Then, for n N0 we have that p Un with probability at least 1 1/n2 because using Hoeﬀding s inequality we have that

2 log(|Y|(Dn + 1)) + 2 log(2/(1/n2))

with probability at least 1 1/n2. Therefore, since Vn satisﬁes R2.1, using Lemma 11 we have that with probability at least 1 1/n2

R(hn) R(Un) inf µ 1 τ T nµ + ϕ(µ) + λT n|µ|

1 τ T n µn + ϕ( µn) + λT n| µn| = R(Vn) + (τ n τ n)T µn

where µn is the solution of (6) for τ = τ n and λ = λn. If λn is the smallest component of λn, we have that µn 1 1/λn because

0 1 (τ n )T µn + ϕ( µn) + λT n| µn| = R(Vn) 1

1 (τ n )T µn + ϕ( µn) R(Un ) 0.

Therefore, with probability at least 1 1/n2

R(hn) R(Vn) + τ n τ n µn 1 R(Vn) + C

2 log(|Y|(Dn + 1)) + 2 log 2n2

Let ε > 0 and consider N N0 such that for any n N

R(Vn) < RBayes + ε

2 log(|Y|(Dn + 1)) + 2 log 2n2

Such N exists because

R(Vn) RBayes and nλn log n

when n tends to inﬁnity, and Dn = O(nk). Therefore, X

n 1 P{R(hn) RBayes ε} N 1 + X

so that the result follows using the Borel-Cantelli Lemma.

Minimax Risk Classifiers with 0 -1 Loss

Appendix H. Proof of Theorem 9

Proof We ﬁrst show that optimization problems (20) and (21) using Xs instead of X are equivalent to those using subsets of X that cover most of the probability mass of the underlying distribution. We then prove that the probabilities of error in such subsets are near the probabilities of error in all the set X. Let µu be a solution of the optimization problem

min µ 1 τ Tµ + max x Xs,y Y Φ(x, y)Tµ hs(y|x) + λT|µ| (51)

and µl be a solution of the optimization problem

max µ 1 τ Tµ + min x Xs,y Y Φ(x, y)Tµ hs(y|x) λT|µ| (52)

Both µu and µl exist because Xs is ﬁnite and Rs(U) is ﬁnite. If Xu and Xl are the sets

Xu = n x X : Φ(x, y)Tµu hs(y|x) max x Xs

Φ(x, y)Tµu hs(y|x) , y Y o

Xl = n x X : Φ(x, y)Tµl hs(y|x) min x Xs

Φ(x, y)Tµl hs(y|x) , y Y o

we have that optimization problem (51) is equivalent to that obtained substituting Xs by Xu because the objective function of the former problem is a lower bound for the latter and both coincide in µu as a direct consequence of the deﬁnition of Xu. Similarly, (52) is equivalent to that obtained substituting Xs by Xl. We next proof that with probability at least 1 δ over the randomness of Xs, the sets Xu and Xl have probability larger than 1 εs with respect to the probability distribution p (x). For each j Y, let µu,j and µl,j be the vectors of size |Y|(m+1) formed by concatenating the vectors [µT u , 1]TI{y = j} and [µT l , 1]TI{y = j} for y = 1, 2, . . . , |Y|, respectively. In addition, for each j Y, let du,j, dl,j R be given by

du,j = max x Xs

Φ(x, j)Tµu hs(j|x)

dl,j = min x Xs

Φ(x, j)Tµl hs(j|x) .

Then, denoting u(x) = [Φ(x, 1)T, hs(1|x), Φ(x, 2)T, hs(2|x), . . . , Φ(x, |Y|)Ths(|Y||x), ]T R|Y|(m+1) we have that

Xu = {x X : u(x)Tµu,j du,j, j Y}

Xl = {x X : u(x)Tµl,j dl,j, j Y}.

Hence, if A is the set of half spaces in R|Y|(m+1), and S(A, s) denotes the s-th shatter coeﬃcient of A (see e.g., Theorem 12.5 in Devroye et al. (1996)), we have that with probability at least 1 δ

E{I{u(x)Tµu,j du,j}} 1

i=1 I{u(xi)Tµu,j du,j}

32(log 8S(A, s) + log |Y|

Mazuelas, Romero, and Gr unwald

E{I{u(x)Tµl,j dl,j}} 1

i=1 I{u(xi)Tµl,j dl,j}

32(log 8S(A, s) + log |Y|

for all j Y. Therefore, using the union bound and the fact that S(A, s) 2(s 1)(m+1)|Y| + 2 4s(m+1)|Y| (see e.g., Corollary 13.1 in Devroye et al. (1996)) we get that with probability at least 1 δ over the randomness of Xs, the sets Xu and Xl have probability larger than 1 εs. For the last step of the proof, let R|Z(h) denote the risk of rule h restricted to Z X, that is,

x Z,y Y ℓ(h, (x, y))dp |Z(x, y)

for p |Z the probability measure corresponding to p restricted to Z, that is

p |Z(x, y) =

p (Z) if x Z 0 otherwise.

Then, with probability at least 1 δ, we have that

R|Xl(hs) εs R(hs) R|Xu(hs) + εs (53)

because for any rule h and set Z X

x X,y Y ℓ(h, (x, y))dp (x, y)

x Z,y Y ℓ(h, (x, y))dp |Z(x, y) + Z

x X\Z,y Y ℓ(h, (x, y))dp (x, y)

= R|Z(h) (1 p (Z))R|Z(h) + Z

x X\Z,y Y ℓ(h, (x, y))dp (x, y)

R(h) R|Z(h) + (1 p (Z))(1 R|Z(h)) R|Z(h) + (1 p (Z))

R(h) R|Z(h) (1 p (Z))

because 0 ℓ(h, (x, y)) 1 for any x and y. Taking U1 = {p (Xu Y) : |EpΦ τ| λ + 2Cεs1}

U2 = {p (Xl Y) : |EpΦ τ| λ + 2Cεs1}

we have that p U implies that p |Xu U1 and p |Xl U2 with probability at least 1 δ. Then, using Theorem 5 we have that with probability at least 1 2δ

R|Xu(hs) sup p U1 ℓ(hs, p) inf µ 1 τ Tµ + sup x Xu,y Y {Φ(x, y)Tµ hs(y|x)} + λT|µ| + 2Cεs µ 1

1 τ Tµu + sup x Xu,y Y {Φ(x, y)Tµu hs(y|x)} + λT|µu| + 2Cεs µu 1

= 1 τ Tµu + max x Xs,y Y{Φ(x, y)Tµu hs(y|x)} + λT|µu| + 2Cεs µu 1

= Rs(U) + εs2C µu 1 (54)

Minimax Risk Classifiers with 0 -1 Loss

using the deﬁnition of Xu and Corollary 6. In addition, we have that with probability at least 1 2δ

R|Xl(hs) inf p U2 ℓ(hs, p) sup µ 1 τ Tµ + inf x Xl,y Y{Φ(x, y)Tµ hs(y|x)} λT|µ| 2Cεs µ 1

1 τ Tµl + inf x Xl,y Y{Φ(x, y)Tµl hs(y|x)} λT|µl| 2Cεs µl 1

= 1 τ Tµl + min x Xs,y Y{Φ(x, y)Tµl hs(y|x)} λT|µl| 2Cεs µl 1

= Rs(U) εs2C µl 1 (55)

using the deﬁnition of Xl. Therefore, the result is obtained since inequalities (53), (54), (55) are simultaneously satisﬁed with probability at least 1 2δ.

Appendix I. Proof of Theorem 10

Proof Let α, G, and H be given as in Algorithm 2. If g(µ) is the subgradient given by (32), we have that

F (µ cg(µ)) + b = F µ c a + λ sign(µ) + coli(µ)(F) + b

= Fµ + b c α + Hsign(µ ) + coli(µ)(G) c H sign(µ) sign(µ )

for any µ Rm and c R. Then, using induction it is straightforward to show that in Algorithm 2 we have that

yk+1 = µk ckg(µk),

µk+1 = yk+1 + θk(θ 1 k 1)(yk+1 yk)

wk+1 = Fyk+1 + b

vk+1 = Fµk+1 + b

ik+1 = arg max Fµk+1 + b

for all k. Therefore the sequences {µk} generated by Algorithms 1 and 2 are identical. To prove the second claim, note that the computational complexity of Algorithm 1 in each iteration is given by the multiplication Fµk+1+b in step 6, which has a time complexity of O(pm). On the other hand, the computational complexity of Algorithm 2 in each iteration is determined by steps 7-14 which have a time complexity of O(pmγ( k)) where γ( k) denotes the fraction of non-zero components of vector k = sign(µk+1) sign(µk).

Mazuelas, Romero, and Gr unwald

Charalambos D. Aliprantis and Kim C. Border. Inﬁnite dimensional analysis. Springer Verlag, Berlin, 1994.

Ver onica Alvarez, Santiago Mazuelas, and Jose A. Lozano. Minimax classiﬁcation under concept drift with multidimensional adaptation and performance guarantees. In Proceedings of the 39th International Conference on Machine Learning, pages 486 499, 17 23 Jul 2022.

Amiran Ambroladze, Emilio Parrado-Hern andez, and John Shawe-Taylor. Tighter PACBayes bounds. In Advances in Neural Information Processing Systems, volume 19, pages 9 16, 2007.

Kaiser Asif, Wei Xing, Sima Behpour, and Brian D. Ziebart. Adversarial cost-sensitive classiﬁcation. In Conference on Uncertainty in Artiﬁcial Intelligence, pages 92 101, 2015.

Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(21):1 38, 2017.

G okhan Bakir, Thomas Hofmann, Bernhard Sch olkopf, Alexander J. Smola, and Ben Taskar. Predicting structured data. MIT press, 2007.

Akshay Balsubramani and Yoav Freund. Optimally combining classiﬁers using unlabeled data. In Proceedings of The 28th Conference on Learning Theory, volume 40, pages 211 225, Paris, France, 03 06 Jul 2015.

Akshay Balsubramani and Yoav Freund. Optimal binary classiﬁer aggregation for general losses. In Advances in Neural Information Processing Systems, volume 29, 2016.

Peter L. Bartlett, Michael I. Jordan, and Jon D. Mc Auliﬀe. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

Luc B egin, Pascal Germain, Fran cois Laviolette, and Jean-Francis Roy. PAC-Bayesian bounds based on the R enyi divergence. In Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 435 444, May 2016.

Shai Ben-David, Nadav Eiron, and Philip M. Long. On the diﬃculty of approximately maximizing agreements. J. Comput. Syst. Sci., 66(3):496 514, May 2003.

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798 1828, March 2013.

Dimitri P. Bertsekas. Convex optimization algorithms. Athena Scientiﬁc, Belmont, MA, 2015.

Kartheek Bondugula, Ver onica Alvarez, Jose I. Segovia-Mart ın Santiago Mazuelas, and Aritz P erez. MRCpy: A library for minimax risk classiﬁers. ar Xiv preprint, ar Xiv:2108.01952, 2023.

Minimax Risk Classifiers with 0 -1 Loss

Jonathan M. Borwein and Qiji J. Zhu. Techniques of variational analysis. Springer, Berlin, 2004.

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.

Peter B uhlmann and Sara van de Geer. Statistics for high-dimensional data. Springer, Heidelberg, 2011.

Emmanuel J. Cand es, Michael B. Wakin, and Stephen P. Boyd. Enhancing sparsity by reweighted L1 minimization. Journal of Fourier Analysis and Applications, 14:877 905, 2008.

Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, and Yiming Ying. Universal multi-task kernels. Journal of Machine Learning Research, 9(52):1615 1646, 2008.

Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Umar Syed. Structural maxent models. In Proceedings of the 32nd International Conference on Machine Learning, pages 391 399, July 2015.

Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7(19):551 585, 2006.

Marie Devaine, Pierre Gaillard, Yannig Goude, and Gilles Stoltz. Forecasting electricity consumption by aggregating specialized experts - a review of the sequential aggregation of specialized experts, with an application to Slovakian and French country-wide one-dayahead (half-)hourly predictions. Machine Learning, 90(2):231 260, 2013.

Luc Devroye, L aszl o Gy orﬁ, and G abor Lugosi. A probabilistic theory of pattern recognition. Springer Verlag, Berlin, 1996.

Dheeru Dua and Casey Graﬀ. UCI Machine Learning Repository, 2017. URL http: //archive.ics.uci.edu/ml.

John Duchi and Hongseok Namkoong. Variance-based regularization with convex objectives. Journal of Machine Learning Research, 20(68):1 55, 2019.

John Duchi, Peter Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 46(3): 946 969, 2021.

Theodoros Evgeniou, Massimiliano Pontil, and Tomaso Poggio. Regularization networks and support vector machines. Advances in computational mathematics, 13(1):1 50, 2000.

Farzan Farnia and David Tse. A minimax approach to supervised learning. In Advances in Neural Information Processing Systems, volume 29, pages 4240 4248, 2016.

Rizal Fathony, Anqi Liu, Kaiser Asif, and Brian D. Ziebart. Adversarial multiclass classiﬁcation: A risk minimization perspective. In Advances in Neural Information Processing Systems, volume 29, 2016.

Mazuelas, Romero, and Gr unwald

Vitaly Feldman, Venkatesan Guruswami, Prasad Raghavendra, and Yi Wu. Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing, 41(6):1558 1590, 2012.

Charlie Frogner, Sebastian Claici, Edward Chien, and Justin Solomon. Incorporating unlabeled data into distributionally robust learning. Journal of Machine Learning Research, 22(56):1 46, 2021.

Pascal Germain, Alexandre Lacasse, Francois Laviolette, Mario March, and Jean-Francis Roy. Risk bounds for the majority vote: From a PAC-Bayesian analysis to a learning algorithm. Journal of Machine Learning Research, 16(26):787 860, 2015.

Peter Gr unwald, Thomas Steinke, and Lydia Zakynthinou. PAC-bayes, MAC-bayes and conditional mutual information: Fast rate bounds that handle general VC classes. In Proceedings of Thirty Fourth Conference on Learning Theory, pages 2217 2247, 2021.

Peter D. Gr unwald and A. Philip Dawid. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. The Annals of Statistics, 32(4):1367 1433, 2004.

Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman and Hall/CRC, 2019.

Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails. Journal of Machine Learning Research, 17(18):1 40, April 2016.

John Langford. Tutorial on practical prediction theory for classiﬁcation. Journal of Machine Learning Research, 6(10):273 306, 2005.

Guy Lebanon and John Laﬀerty. Boosting and maximum likelihood for exponential models. In Advances in Neural Information Processing Systems, volume 14, pages 447 454, 2001.

Jaeho Lee and Maxim Raginsky. Minimax statistical learning with Wasserstein distances. In Advances in Neural Information Processing Systems, volume 31, pages 2692 2701, 2018.

David G. Luenberger. Optimization by vector space methods. John Wiley & Sons, New York, 1997.

G abor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145 1190, October 2019.

Andreas Maurer and Massimiliano Pontil. Empirical Bernstein bounds and sample variance penalization. In Proceedings of the 22nd Annual Conference on Learning Theory, pages 73 83, 2009.

Santiago Mazuelas and Aritz Perez. General supervision via probabilistic transformations. In 24th European Conference on Artiﬁcial Intelligence-ECAI 2020, pages 1348 1354, August 2020.

Minimax Risk Classifiers with 0 -1 Loss

Santiago Mazuelas, Andrea Zanoni, and Aritz P erez. Minimax classiﬁcation with 0-1 loss and performance guarantees. In Advances in Neural Information Processing Systems, volume 33, pages 302 312, 2020.

Santiago Mazuelas, Yuan Shen, and Aritz P erez. Generalized maximum entropy for supervised classiﬁcation. IEEE Trans. Inf. Theory, 68(4):2530 2550, April 2022.

Zakaria Mhammedi, Peter Gr unwald, and Benjamin Guedj. PAC-Bayes un-expected Bernstein inequality. In Advances in Neural Information Processing Systems, volume 32, pages 12202 12213, 2019.

Charles A. Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of Machine Learning Research, 7(95):2651 2667, 2006.

Mehryar Mohri and Andres Munoz Medina. New analysis and algorithm for learning with drifting distributions. In International Conference on Algorithmic Learning Theory, pages 124 138, 2012.

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, Cambridge, MA, second edition, 2018.

Christine De Mol, Ernesto De Vito, and Lorenzo Rosasco. Elastic-net regularization in learning theory. Journal of Complexity, 25(2):201 230, 2009.

Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Sch olkopf. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2):1 141, 2017.

Yurii Nesterov. A method of solving a convex programming problem with convergence rate O 1

k2 . Soviet Math. Doklady, 269(3):543 547, 1983.

Yurii Nesterov. Subgradient methods for huge-scale optimization problems. Mathematical Programming, pages 275 297, 2014.

Yurii Nesterov and Vladimir Shikhman. Quasi-monotone subgradient methods for nonsmooth convex minimization. Journal of Optimization Theory and Applications, 165(3):917 940, June 2015.

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, volume 20, 2008.

Soroosh Shaﬁeezadeh-Abadeh, Peyman Mohajerin Esfahani, and Daniel Kuhn. Distributionally robust logistic regression. In Advances in Neural Information Processing Systems, volume 28, pages 1576 1584, 2015.

Soroosh Shaﬁeezadeh-Abadeh, Daniel Kuhn, and Peyman Mohajerin Esfahani. Regularization via mass transportation. Journal of Machine Learning Research, 20(103):1 68, 2019.

Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University press, New York, 2014.

Mazuelas, Romero, and Gr unwald

Matthew Staib and Stefanie Jegelka. Distributionally robust optimization and generalization in kernel methods. In Advances in Neural Information Processing Systems, volume 32, pages 9134 9144, 2019.

Ingo Steinwart. Consistency of support vector machines and other regularized kernel classiﬁers. IEEE Transactions on Information Theory, 51(1):128 142, 2005.

Wei Tao, Zhisong Pan, Gaowei Wu, and Qing Tao. The strength of Nesterov s extrapolation in the individual convergence of nonsmooth optimization. IEEE Transactions on Neural Networks and Learning Systems, 31(7):2557 2568, 2020.

Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classiﬁcation methods. Journal of Machine Learning Research, 8(36):1007 1025, 2007.

Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6(50):1453 1484, 2005.

Vladimir Vapnik. Statistical learning theory. Wiley, New York, 1998.

Martin J. Wainwright. High-dimensional statistics: a non-asymptotic viewpoint. Cambridge University Press, Cambridge, United Kingdom, 2019.

Ian Waudby-Smith and Aaditya Ramdas. Estimating means of bounded random variables by betting. Journal of the Royal Statistical Society Series B: Statistical Methodology, February 2023.

Tianbao Yang and Qihang Lin. RSG: Beating subgradient method without smoothness and strong convexity. Journal of Machine Learning Research, (19):1 33, 2018.

Tong Zhang. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. The Annals of Statistics, 32(1):56 85, February 2004.

Hui Zou. The adaptive Lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418 1429, 2006.