# minimax_risk_classifiers_with_01_loss__8731037f.pdf Journal of Machine Learning Research 24 (2023) 1-48 Submitted 4/22; Revised 2/23; Published 7/23 Minimax Risk Classifiers with 0 -1 Loss Santiago Mazuelas smazuelas@bcamath.org Basque Center for Applied Mathematics (BCAM) IKERBASQUE-Basque Foundation for Science Bilbao 48009, Spain Mauricio Romero msicre@ufba.br Federal University of Bahia Ondina 40170, Brazil Peter Gr unwald Peter.Grunwald@cwi.nl National Research Institute for Mathematics and Computer Science (CWI) Amsterdam 94079, Netherlands Editor: Sivan Sabato Supervised classification techniques use training samples to learn a classification rule with small expected 0 -1 loss (error probability). Conventional methods enable tractable learning and provide out-of-sample generalization by using surrogate losses instead of the 0 -1 loss and considering specific families of rules (hypothesis classes). This paper presents minimax risk classifiers (MRCs) that minimize the worst-case 0 -1 loss with respect to uncertainty sets of distributions that can include the underlying distribution, with a tunable confidence. We show that MRCs can provide tight performance guarantees at learning and are strongly universally consistent using feature mappings given by characteristic kernels. The paper also proposes efficient optimization techniques for MRC learning and shows that the methods presented can provide accurate classification together with tight performance guarantees in practice. Keywords: Supervised Classification, Robust Risk Minimization, Performance Guarantees, Generalized Maximum Entropy 1. Introduction Supervised classification techniques use training samples to learn a classification rule that assigns labels to instances with small expected 0 -1 loss (error probability). Conventional methods enable tractable learning and provide out-of-sample generalization by using surrogate losses instead of the 0 -1 loss and considering specific families of rules (hypothesis classes). The surrogate losses are usually taken to be convex upper bounds of the 0 -1 loss such as hinge loss, logistic loss, and exponential loss, see e.g., Bartlett et al. (2006). The families of rules considered are usually given by parametric functions such as those defined by neural networks (NNs) and those belonging to reproducing kernel Hilbert spaces (RKHSs), see e.g., Shalev-Shwartz and Ben-David (2014). Such techniques can result in strongly universally consistent methods using surrogate losses that are classification calibrated and Lipschitz (Bartlett et al., 2006; Tewari and Bartlett, 2007) together with rich families of rules such as c 2023 Santiago Mazuelas, Mauricio Romero, and Peter Gr unwald. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-0339.html. Mazuelas, Romero, and Gr unwald those given by functions in RKHSs corresponding with universal kernels (Micchelli et al., 2006; Steinwart, 2005). Most learning methods are based on the empirical risk minimization (ERM) approach that minimizes the empirical expected loss of training samples (see e.g., Vapnik (1998); Mohri et al. (2018); Shalev-Shwartz and Ben-David (2014)). Other methods are based on the robust risk minimization (RRM) approach that minimizes the worst-case expected loss with respect to an uncertainty set of distributions (see e.g., Asif et al. (2015); Shafieezadeh-Abadeh et al. (2019); Duchi and Namkoong (2019)). These methods correspond with generalized maximum entropy techniques as shown in Mazuelas et al. (2022) using the theoretical framework in Gr unwald and Dawid (2004). RRM methods mainly differ in the type of uncertainty set considered. These sets are determined by constraints over probability distributions given in terms of metrics such as f-divergence (Duchi and Namkoong, 2019), Wasserstein distances (Shafieezadeh-Abadeh et al., 2019), moments fits (Asif et al., 2015), and maximum mean discrepancies (Staib and Jegelka, 2019). In addition, the uncertainty sets considered often only include probability distributions with instances marginal that coincides with the empirical marginal of training samples (Asif et al., 2015; Farnia and Tse, 2016; Fathony et al., 2016; Cortes et al., 2015). An important advantage of RRM methods is that the function minimized at learning can be an upper bound for the expected loss if the true underlying distribution is included in the uncertainty set considered. In such cases, RRM techniques automatically ensure out-ofsample generalization and provide performance guarantees at learning. The uncertainty sets considered by existing techniques can include the true underlying distribution in methods that address a transductive setting (Balsubramani and Freund, 2015, 2016) or utilize Wasserstein distances (Shafieezadeh-Abadeh et al., 2019; Lee and Raginsky, 2018; Frogner et al., 2021) and maximum mean discrepancies (Staib and Jegelka, 2019). However, uncertainty sets formed by distributions with instances marginal that coincides with the empirical do not include the true underlying distribution for finite sets of training samples. The original 0 -1 loss is utilized by certain RRM techniques (Asif et al., 2015; Farnia and Tse, 2016; Fathony et al., 2016; Balsubramani and Freund, 2015) which provided inspiration for the present work. Most of these pioneering techniques considered uncertainty sets of distributions with instances marginal that coincides with the empirical of training samples, and therefore they do not provide tight performance guarantees at learning. Such performance guarantees are provided by PAC-Bayes methods (Ambroladze et al., 2007; Germain et al., 2015; Mhammedi et al., 2019) and RRM techniques in transductive settings (Balsubramani and Freund, 2015) or based on Wasserstein distances (Shafieezadeh-Abadeh et al., 2019; Lee and Raginsky, 2018). PAC-Bayes methods consider specific classification rules such as margin-based and ensemble-based classifiers. In addition, the tightness of their performance bounds relies on that of multiple inequalities including Jensen s, Markov s, and Donsker-Varadhan s change of measure (B egin et al., 2016). Wasserstein-based RRM techniques utilize surrogate losses and consider specific families of rules. In addition, the tightness of their performance bounds relies heavily on the adequacy of the Wasserstein radius used (Shafieezadeh-Abadeh et al., 2019; Frogner et al., 2021). This paper presents minimax risk classifiers (MRCs) that minimize the worst-case 0 -1 loss over general classification rules and provide tight performance bounds at learning. Specifically, the main results presented in the paper are as follows. Minimax Risk Classifiers with 0 -1 Loss We develop learning methods that obtain classification rules with the smallest worstcase expected 0 -1 loss with respect to uncertainty sets that can include the underlying distribution, with a tunable confidence (Section 2.2). In addition, we detail how to determine uncertainty sets from training data by estimating the expectation of a feature mapping and obtaining confidence vectors for such estimates (Section 3). We characterize the performance guarantees of MRCs in terms of tight upper and lower bounds for the error probability and also in terms of generalization bounds with respect to the smallest minimax risk (Sections 4.1 and 4.2). In addition, we show that MRCs are strongly universally consistent using feature mappings given by characteristic kernels (Section 4.3). We present efficient optimization techniques for MRC learning that obtain the classifiers parameters and the tight performance bounds using reduced sets of instances and efficient accelerated subgradient methods (Section 5). In addition, we quantify MRCs performance with respect to existing techniques and show the suitability of the performance bounds presented (Section 6). Some of the results presented in this paper have appeared before in Mazuelas et al. (2020). The main new results presented in this paper include: extension to instances sets that are general Borel subsets; analysis of the generalization capabilities of the approach presented even in cases where the underlying distribution is not included in the uncertainty set considered; universal consistency of the methods proposed using rich feature mappings; and efficient optimization techniques based on accelerated subgradient methods. Specifically, the proofs of the theoretical results are extended to allow for infinite sets of instances instead of finite sets by using Fenchel duality instead of Lagrange duality. The MRCs generalization bounds are extended in new Theorem 7 to cases when the uncertainty set does not include the underlying distribution, and new Theorem 8 shows the universal consistency of MRCs that use feature mappings given by characteristic kernels. In addition, new Theorems 9 and 10 show that MRC learning can be efficiently addressed using reduced sets of instances and subgradient methods that exploit the specific structure of the optimization problems in MRC learning. Notation: calligraphic upper case letters denote sets, e.g., Z; |Z| denotes de cardinality of set Z; vectors and matrices are denoted by bold lower and upper case letters, respectively, e.g., v and M; for a vector v, v , v 1, and v denote its L2-, L1-, and L-infinity norms, respectively, v T denotes its transpose, v(i) denotes its i-th component, |v| and v+ denote the vector given by the component-wise absolute value and positive part of v, respectively, and diag(v) denotes the diagonal matrix with diagonal given by v; probability distributions and classification rules are denoted by upright fonts, e.g., p and h; Ez p or simply Ep denotes the expectation w.r.t. probability distribution p of random variable z; for a probability distribution p over X Y, we denote by px, py its corresponding marginals over X and Y, respectively, and by px|y the corresponding conditional distribution given y Y; and denote vector (component-wise) inequalities; 1 denotes a vector with all components equal to 1 and I{ } denotes the indicator function; finally, for a function of two variables f : X Y Z, f(x, ) denotes the function f(x, ) : Y Z obtained by fixing x X. Mazuelas, Romero, and Gr unwald 2. Minimax risk classifiers This section first states the problem of supervised classification and summarizes different learning approaches, then we describe MRCs with 0 -1 loss and their relationship with existing techniques. 2.1 Problem formulation and learning approaches Classification techniques assign instances in a set X to labels in a set Y, where X is a Borel subset of Rd and Y is a finite set represented by {1, 2, . . . , |Y|}. We denote by (X Y) the set of Borel probability measures p on X Y, also referred to as probability distributions. Both randomized and deterministic classification rules are given by functions from instances to probability distributions on labels (Markov transitions). We denote the set of all classification rules by T(X, Y), and for h T(X, Y) we denote by h(y|x) the probability with which instance x X is classified by label y Y (h(y|x) {0, 1} if labels are deterministically assigned to instances). Supervised classification techniques use n training instance-label pairs (x1, y1), (x2, y2), . . . , (xn, yn) from the underlying distribution p to find classification rules that assign labels to instances with small error probability. If p (X Y) is the true underlying distribution of instance-label pairs, the error probability of a classification rule h T(X, Y) is its expected 0 -1 loss denoted by R(h), that is R(h) = Ep {ℓ(h, (x, y))} with ℓ(h, (x, y)) = 1 h(y|x) the 0 -1 loss of rule h at instance-label pair (x, y). In the following, we overload the notation for the loss function and denote the expected loss of h with respect to probability distribution p as ℓ(h, p) := Ep{ℓ(h, (x, y))}. The optimal classification rule is the solution of the optimization problem P : inf h T(X,Y) ℓ(h, p ). (1) This solution is known as Bayes classification rule and the minimum value above is known as Bayes risk. Optimization problem P cannot be addressed in practice since the underlying distribution p is unknown and only training samples (x1, y1), (x2, y2), . . . , (xn, yn) from p are available. Existing supervised classification techniques can be interpreted as approximations of P by an optimization problem of the form P : inf h F sup p U L(h, p) (2) for F a family of classification rules, U an uncertainty set of distributions, and L a surrogate loss function. The ERM approach considers uncertainty sets of distributions that contain only the empirical distribution of training samples. In such an approach, the approximation of P by P is controlled by the choice of the family F T(X, Y) through a bias-complexity trade-offaddressed by structural risk minimization (SRM) (Vapnik, 1998). Families of rules reduced enough to provide uniform convergence of empirical averages can ensure that Minimax Risk Classifiers with 0 -1 Loss the objective function in P uniformly approximates that of P , i.e., empirically averaged losses accurately approximate expected losses for all the rules considered. If the family of rules is also general enough to contain a classification rule near the Bayes rule, then the minimum value of P is similar to the minimum value of P and the ERM approach leads to near-optimal performance. This trade-offfor the generality of the family of rules considered is usually controlled by selecting a parameter that determines the size of the family of rules, e.g., the radius of the RKHS ball of functions that defines the family of rules. The RRM approach considers uncertainty sets of distributions determined by constraints obtained from training samples. In such an approach, the approximation of P by P is controlled by the choice of the uncertainty set U (X Y). Uncertainty sets general enough to contain the underlying distribution can ensure that the objective function in P upper bounds that of P , i.e., the worst-case expected loss upper bounds the actual expected loss for any classification rule. If the uncertainty set is also reduced enough to provide a tight upper bound, then the minimum value of P is similar to the minimum value of P and the RRM approach leads to near-optimal performance. This trade-offfor the generality of the uncertainty set considered is usually controlled by selecting a parameter that determines the size of the uncertainty set, e.g., the radius of the Wasserstein ball that defines the uncertainty set. An important advantage of the RRM approach with respect to ERM is that the former does not require to constrain the classification rules considered in order to provide provable out-of-sample generalization. Such property is due to the fact that the optimum of P for RRM upper bounds the expected loss with respect to any distribution in the uncertainty set, independently of the family of rules considered. On the other hand, the optimum of P for ERM is ensured to be near the out-of-sample risk only if the family of classification rules considered provides uniform convergence of empirical averages. Optimization problem P not only has to reliably approximate P but also has to be tractable computationally. Such tractability is commonly achieved by 1) considering families of classification rules determined by certain parameters (e.g., weights of neurons in NNs or coefficients of linear classifiers), and 2) substituting the 0 -1 loss ℓby a surrogate loss L (e.g., hinge-loss or logistic-loss). The usage of non-parametric classification rules can lead to huge-scale optimization problems (a general rule h T(X, Y) is determined by |X||Y| values), and the minimization of 0 -1 loss is often NP-hard (see e.g., Ben-David et al. (2003); Feldman et al. (2012)). In the following we show how the proposed MRCs approximate the optimization problem P in (1) by an optimization problem of the form PMRC : inf h T(X,Y) sup p U ℓ(h, p) (3) for an uncertainty set U that can include the underlying distribution with a tunable confidence. MRCs do not rely on a choice of surrogate loss and family of rules. The only change in PMRC with respect to P consists on using an uncertainty set U instead of the underlying distribution p . Certain previous RRMs methods also address an optimization problem of the form (3) (Asif et al., 2015; Fathony et al., 2016). However, the uncertainty sets considered by such methods only include distributions with instances marginals that coincide with the Mazuelas, Romero, and Gr unwald empirical. With such an additional constraint for the uncertainty set, these approaches become equivalent to ERM with a surrogate loss referred to as adversarial zero-one loss. Therefore, their out-of-sample generalization properties are akin to those of other ERM methods and cannot exploit the RRM s benefits of approximating P by an upper bound. 2.2 MRCs with 0 -1 loss The classification loss ℓ(h, (x, y)) of rule h at instance-label pair (x, y) quantifies the loss of the rule evaluated at instance x X when the label is y Y. In this paper we consider 0 -1 loss that is given by ℓ(h, (x, y)) = 1 h(y|x), while MRCs for general loss functions are described in (Mazuelas et al., 2022). The usage of 0 -1 loss is specially suitable for discriminative approaches since it quantifies the classification error, while other loss functions such as logistic loss can be more suitable for conditional probability estimation since they score probability assessments. Specifically, if p is the true underlying distribution of instance-label pairs, the expected 0 -1 loss ℓ(h, p ), also referred to as the risk R(h), coincides with the error probability of classification rule h. The proposed MRCs minimize the worst-case expected 0 -1 loss with respect to distributions in uncertainty sets that can contain the true underlying distribution with a tunable confidence. These uncertainty sets are given by constraints on the expectations of a vector-valued bounded and Borel measurable function Φ : X Y Rm referred to as feature mapping, e.g., multiple polynomials on x and y or one-hot encodings of the last layers in an NN. Such mappings are commonly used in machine learning to represent instance-label pairs as real vectors (see e.g., Mohri et al. (2018); Bengio et al. (2013) and next Section 3.1). In the following we consider uncertainty sets U = n p (X, Y) : |Ep{Φ} τ| λ o (4) given by a feature mapping Φ together with mean and confidence vectors τ Rm and λ Rm, and satisfying one of the following regularity conditions. R1 The set X is finite and there exists a probability measure p in U. R2 There exists a probability measure p in U such that |Ep{Φ(x, y)(i)} τ (i)| < λ(i) for any i {1, 2, . . . , m} with λ(i) > 0 and R2.1 λ(i) > 0 for all i {1, 2, . . . , m} or R2.2 the support of the r.v. Φ(x, y) for (x, y) p is not contained in a proper affine subspace of Rm. These regularity conditions are utilized to ensure strong duality holds for the inner maximization in the minimax problems, and they are satisfied with wide generality. For instance, such conditions are satisfied if there exists a probability measure p such that Ep{Φ} = τ and {Φ(xi, yi), i = 1, 2, . . . , n} is not contained in a proper subspace of Rm. In addition, if those n points are contained in a proper affine subspace Γ Rm with dim Γ = m < m the feature mapping could be easily modified so that R2.2 is satisfied for instance by using Φ (x, y) = (Φ(x, y) z0)Tv1, (Φ(x, y) z0)Tv2, . . . , (Φ(x, y) z0)Tvm Rm Minimax Risk Classifiers with 0 -1 Loss where Γ = {z0 + α1v1 + α2v2 + . . . + αm vm ; α1, α2, . . . , αm R}. The mean vector τ Rm in (4) is an estimate of the feature mapping expectation Ep {Φ} with respect to the underlying distribution. In this paper, we consider expectation estimates obtained as the sample average i=1 Φ(xi, yi) (5) obtained from the n training samples (x1, y1), (x2, y2), . . . , (xn, yn) that are assumed to be independent samples from the underlying distribution p . Nevertheless, it is important to note that most of the results presented in the paper as well as the general methodology proposed can be utilized with general types of expectation estimates. Alternative estimators for the expectation can be preferred in several practical scenarios including cases with different distributions at training and test (Mohri and Medina, 2012; Mazuelas and Perez, 2020; Alvarez et al., 2022) and cases where the distribution of features has heavy tails (Lugosi and Mendelson, 2019; Hsu and Sabato, 2016). The confidence vector λ Rm in (4) is an estimate of the mean vector component-wise accuracy |Ep {Φ} τ|, and controls the size of the uncertainty set considered. In particular, if λ corresponds to the length of confidence intervals at level 1 δ centered at τ, then the true underlying distribution is included in the uncertainty set (4) with probability at least 1 δ. Section 3.2 describes how to choose such confidence vectors for different feature mappings. Classification rules that minimize the worst-case error probability over uncertainty sets U given by (4) are referred to as MRCs. Definition 1 We say that a classification rule h U is a 0 -1 MRC for uncertainty set U if h U arg inf h T(X,Y) sup p U ℓ(h, p) and we denote by R(U) the minimax risk against U, i.e., R(U) = inf h T(X,Y) sup p U ℓ(h, p). The following result shows how 0 -1 MRCs can be determined by a linear-affine combination of the feature mapping. The coefficients of such combination can be obtained at learning by solving the convex optimization problem Pτ,λ : min µ 1 τ Tµ + ϕ(µ) + λT|µ| (6) ϕ(µ) = sup x X,C Y P y C Φ(x, y)Tµ 1 1. Here, as in the sequel, we implicitly assume that the set C Y over which we take the maximum excludes the empty set. Mazuelas, Romero, and Gr unwald Theorem 2 Let U be an uncertainty set as in (4) that satisfies R1 or R2 and µ be a solution of optimization problem Pτ,λ. If a classification rule h U T(X, Y) satisfies h U(y|x) Φ(x, y)Tµ ϕ(µ ), x X, y Y (8) then h U is a 0 -1 MRC for U. In addition, we have that R(U) = 1 τ Tµ + ϕ(µ ) + λT|µ |. (9) Proof See Appendix B. A classification rule satisfying (8) always exists because for every x X the sum over y Y of the positive part of the right hand side of (8) is not larger than one. Specifically, for any x X, the number Φ(x, y)Tµ ϕ(µ ) is zero or equal to max C Y y C Φ(x, y)Tµ |C|ϕ(µ ) that is not larger than one by definition of ϕ( ) in (7). The characterization of MRCs in terms of the inequality in (8) may seem counter-intuitive because for some pairs (x, y) such classification rules are not uniquely determined. However, such pairs do not affect the worst-case error probability. Specifically, the set of pairs (x, y) where (8) is not satisfied with equality has zero probability under a worst-case distribution in U because for p U U inf h T(X,Y) sup p U ℓ(h, p) = ℓ(h U, p U) 1 Z h U(y|x)dp U(x, y) = 1 τ Tµ + λT|µ | + ϕ(µ ) Z h U(y|x) Φ(x, y)Tµ ϕ(µ ) dp U(x, y) = τ Ep UΦ(x, y) Tµ λT|µ | Z h U(y|x) Φ(x, y)Tµ ϕ(µ ) dp U(x, y) = 0. since p U U τ Ep UΦ(x, y) Tµ λT|µ | 0. In order to avoid ambiguities, we will refer to 0 -1 MRC for uncertainty set U as the classification rule h U obtained by normalizing the positive part of the right hand side of (8). Specifically, let for x X h U(y|x) = (Φ(x, y)Tµ ϕ(µ ))+/cx if cx = 0 1/|Y| if cx = 0 (11) with cx given by (10). Such classification rule is univocally determined by µ and satisfies (8) because cx 1 for any x X. In addition, we denote by h U d the deterministic classification rule corresponding to h U, that is h U d(y|x) = I{y arg max h U( |x)} = I{y arg max Φ(x, )Tµ } (12) Minimax Risk Classifiers with 0 -1 Loss where a tie in the arg max above can be resolved arbitrarily. In the following, we will refer to such classification rule h U d as the deterministic 0 -1 MRC for U. The above Theorem 2 provides a representer theorem for MRCs. The classical representer theorem for ERM over RKHSs (see e.g., Mohri et al. (2018); Evgeniou et al. (2000)) states that the solution of ERM over the set of classification rules given by functions in an RKHS ball is determined by a linear combination of n functions corresponding with the training samples. Such result enables to address the minimization of empirical loss over an infinitedimensional RKHS because it becomes equivalent to a minimization over n parameters. Analogously, Theorem 2 states that the solution of PMRC over the set of all classification rules is determined by a linear-affine combination of the m components of the feature mapping. Even if the optimization problem addressed by MRCs does not impose constraints on the classification rules considered, the feature mapping utilized determines the parametric form of its solutions. Theorem 2 enables to address the minimization of the worst-case error probability over general classification rules since it becomes equivalent to a minimization over the m parameters µ. As shown in Section 5 below, optimization (6) can be efficiently addressed in practice. In particular, Theorem 9 shows that the set X in (7) can be substituted by a reduced set of instances such as that formed by the instances at training. Note also that MRCs are often given by sparse combinations of the components of the feature mapping since the last term in optimization problem (6) imposes an L1-type regularization for parameters. L1-norm regularization is broadly used in machine learning (see e.g., Hastie et al. (2019); Mohri et al. (2018); Mol et al. (2009)) and the regularization parameter used to weight the L1-norm is commonly obtained by cross-validation methods. The result in Theorem 2 can directly provide appropriate regularization parameters from the length of the expectations interval estimates. In addition, the regularization term λT|µ| = Pm i=1 λ(i)|µ(i)| in (6) allows to penalize differently each component of µ. Such type of L1-norm regularization is usually referred to as adaptive or weighted, and has shown to significantly improve performance (Zou, 2006; Cand es et al., 2008). For MRCs, the regularization term causes that feature components with poorly estimated expectations (i.e., components Φ(i) with large λ(i)) have a reduced or null influence on the final classification rule. 2.3 MRCs with 0 -1 loss and fixed marginals The following result shows how existing RRM methods that utilize 0-1 loss correspond to MRCs that use uncertainty sets of distributions with instances marginal given by the empirical distribution. In particular, such MRCs correspond to L1-regularized ERM with a loss referred to as adversarial zero-one in Fathony et al. (2016) or as minimax hinge in Farnia and Tse (2016). For the following result we consider uncertainty sets of the form V = n p (X, Y) : |Ep{Φ} τ| λ and px = pn x o (13) where pn x is the uniform distribution over instances with support {x1, x2, . . . , xn} X for n specific instances. Mazuelas, Romero, and Gr unwald Theorem 3 Let τ, λ Rm be such that the uncertainty set V in (13) is not empty. If µ is a solution of the optimization problem min µ 1 τ Tµ + 1 i=1 ϕ(µ, xi) + λT|µ| (14) ϕ(µ, x) = max C Y P y C Φ(x, y)Tµ 1 and h V is the classification rule h V(y|x) = Φ(x, y)Tµ ϕ(µ , x) +, x X, y Y (15) then, h V is a 0 -1 MRC for V, that is h V arg inf h T(X,Y) sup p V ℓ(h, p). Proof See Appendix C. If the instances {x1, x2, . . . , xn} are those obtained at training, the 0 -1 MRCs for uncertainty sets given by both expectations and marginals constrains as in (13) correspond to existing techniques known as maximum entropy machines (Farnia and Tse, 2016) or zero-one adversarial (Fathony et al., 2016). Specifically, if {(xi, yi)}n i=1 are training samples and τ is given by (5), then optimization problem (14) becomes i=1 max C Y P y C(Φ(xi, y) Φ(xi, yi))Tµ + |C| 1 |C| + λT|µ| that corresponds to L1-regularized ERM with a loss referred to as minimax hinge in Farnia and Tse (2016) or as adversarial zero-one in Fathony et al. (2016). Uncertainty sets V given by (13) do not contain the true underlying distribution for a finite number of training samples since such sets only contain distributions with instances marginal that is uniform over the training instances. Hence, the usage of such uncertainty sets does not allow to obtain the performance guarantees shown in Section 4 below for uncertainty sets given by (4). 3. Uncertainty sets of distributions As described above, MRC classification rules have the smallest worst-case error probability over distributions in an uncertainty set. This set is determined by expectations estimates of a feature mapping. In this section we first describe common feature mappings and then characterize the accuracy of expectation estimates obtained from training samples. Minimax Risk Classifiers with 0 -1 Loss 3.1 Feature mappings Most of the results presented in the paper are valid for general feature mappings Φ : X Y Rm. In this section we describe feature mappings that are common in supervised classification and will be used later in the paper. Such feature mappings are given by real-valued functions over the set of instances ψ : X R, referred to as scalar features. The most common and simple way to define feature mappings over X and Y is to use multiple scalar features over X together with a one-hot encoding of the elements of Y (Tsochantaridis et al., 2005; Crammer et al., 2006; Mohri et al., 2018) as follows Φ(x, y) = [I{y = 1}Ψ(x)T, I{y = 2}Ψ(x)T, . . . , I{y = |Y|}Ψ(x)T]T = ey Ψ(x) (16) where Ψ(x) = [ψ1(x), ψ2(x), . . . , ψD(x)]T for D scalar features ψ1, ψ2, . . . , ψD, ey is the y-th vector in the standard basis of R|Y|, and denotes the Kronecker product. The feature mapping Φ represents each instance-label pair (x, y) by an m-dimensional real vector with m = |Y|D, so that Φ(x, y) is composed by |Y| D-dimensional blocks with values Ψ(x) in the block corresponding to y and zero otherwise. Other feature mappings can exploit relationships among classes as described for instance in Bakir et al. (2007); Caponnetto et al. (2008). Multiple types of scalar features are commonly used in supervised classification including those given by thresholds/decision stumps (Lebanon and Lafferty, 2001), last layers in an NN (Bengio et al., 2013), and random features corresponding to a RKHS (Rahimi and Recht, 2008). Most of the results shown in this paper are valid for general features, while for those showing the universal consistency of MRCs we use random features that embed instances into rich feature spaces corresponding to RKHSs. 3.2 Confidence vectors for expectation estimates In this section, we describe conditions for confidence vectors λ that lead to uncertainty sets given by (4) that contain the true underlying distribution with high probability. Specifically, we consider uncertainty sets U given by feature mappings defined by (16) using scalar features in a family F together with mean vectors obtained as in (5) using the sample average of n independent samples from the underlying distribution p . We denote by λδ any confidence vector that has coverage probability at least 1 δ, i.e., P{|Ep {Φ} τ| λδ} 1 δ, so that the underlying distribution p is included with probability at least 1 δ in the uncertainty set given by λδ. The following result describes such confidence vectors in terms of the cardinality |F|, the empirical variance of the feature mapping, and the Rademacher complexity Rn(F). In most of the results in the paper, we utilize feature mappings given by scalar features chosen before the training samples are observed. In such cases, appropriate confidence vectors are given in terms of the cardinality |F|, i.e., the number of scalar features used. However, the methods proposed can also be used with data-dependent feature mappings. In such cases, appropriate confidence vectors are given in terms of the Rademacher complexity Rn(F) where F is the family of candidate scalar features. For instance, F can be family of all decision stumps and the feature mapping can be defined using the decision stumps found by one-dimensional decision trees learned using the training samples, as in Mazuelas et al. (2020). Mazuelas, Romero, and Gr unwald Theorem 4 Let the mean vector be given by τ = τ n as in (5) for n training samples, and F be a family of bounded scalar features, that is |ψ(x)| C for all ψ F and x X. If |F| < , λ(i) δ for i = 1, 2, . . . , m can be taken as 2 log 2|F||Y|/δ In addition, if υ(i) n is the sample variance of the i-th component of Φ, λ(i) δ for i = 1, 2, . . . , m can be taken as λ(i) δ = 2C 2υ(i) n log 4|F||Y|/δ n + 14C log 4|F||Y|/δ 3(n 1) . (18) If the Rademacher complexity of F satisfies Rn(F) R/ n, λ(i) δ for i = 1, 2, . . . , m can be taken as where j(i) denotes the label for which the i-th component of Φ is non-zero, and nj denotes the number of samples with label j Y. Proof See Appendix D. The previous result shows how to obtain uncertainty sets U as in (4) that contain the true underlying distribution with high probability. Data-based tight confidence vectors can be obtained using sample standard deviations as shown in (18). Note that sample standard deviations can also be used to determine confidence vectors for infinite families of scalar features by using their growth function (Maurer and Pontil, 2009). Theorem 4 shows that confidence vectors λ can decrease at a rate O(1/ n) using general bounded features. Such rate can be obtained similarly with unbounded sub-Gaussian features (Wainwright, 2019) and also with heavy-tailed features by using robust estimators for expectations (Lugosi and Mendelson, 2019). In addition, the order O( p log(|F|)/n) for λ in the above result is consistent with that shown to provide consistent estimators using L1-regularization (B uhlmann and van de Geer, 2011). 4. Performance guarantees This section characterizes the out-of-sample performance of 0 -1 MRCs. We first present techniques that provide tight performance bounds at learning, then we show finite-sample generalization bounds for MRCs. In particular, we show that the proposed techniques can provide strong universal consistency using rich feature representations. 4.1 Tight performance bounds The following result shows that the proposed approach allows to obtain upper and lower bounds for expected losses by solving convex optimization problems. Minimax Risk Classifiers with 0 -1 Loss Theorem 5 Let U be an uncertainty set given by (4), h be any classification rule, and R(U, h) and R(U, h) be given by R(U, h) = sup µ 1 τ Tµ + inf x X,y Y{Φ(x, y)Tµ h(y|x)} λT|µ| (20) R(U, h) = inf µ 1 τ Tµ + sup x X,y Y {Φ(x, y)Tµ h(y|x)} + λT|µ|. (21) Then, for any p U we have that R(U, h) ℓ(h, p) R(U, h). In addition, if U satisfies R1 or R2 the optimal values in (20) and (21) are attained. Proof See Appendix E. The techniques proposed in (Duchi et al., 2021; Shafieezadeh-Abadeh et al., 2015, 2019) obtain upper and lower bounds corresponding with RRM methods that use uncertainty sets defined in terms of f-divergences and Wasserstein distances. Such methods obtain classification rules by minimizing the upper bound of a surrogate expected loss while MRCs minimize the upper bound of the 0 -1 expected loss (error probability). The bounds in (Duchi et al., 2021; Shafieezadeh-Abadeh et al., 2015, 2019) as well as those in Theorem 5 become bounds for the true expected loss (risk) if the uncertainty set considered includes the true underlying distribution. Such situation can be attained with a tunable confidence using uncertainty sets defined by Wasserstein distances as in (Shafieezadeh-Abadeh et al., 2015, 2019) or using the proposed uncertainty sets in (4) with expectation confidence intervals. The above theorem provides lower and upper bounds for the error probability of any classification rule. These bounds can be determined by solving the two convex optimization problems (20) and (21). As shown below, the upper bound for MRCs is obtained as a by-product of the learning process that solves optimization (6). Corollary 6 Let h U be a 0 -1 MRC for an uncertainty set U that satisfies R1 or R2, then R(U, h U) = R(U) given by (9). Proof Straightforward consequence of the above results since R(U, h U) = sup p U ℓ(h U, p) = inf h T(X,Y) sup p U ℓ(h, p) = R(U) In order to simplify notation, we will denote R(U) := R(U, h U) for an MRC h U given by (11). Tight performance bounds can also be obtained at learning for deterministic 0 -1 MRCs. Specifically, such bounds are given by R(U, h U d) and R(U, h U d) that are obtained by solving the two optimization problems (20) and (21). The next section provides generalization bounds for 0 -1 MRCs. Such bounds also ensure generalization for deterministic 0 -1 MRCs because R(h U d) 2R(h U) as a direct consequence of the fact that 1 h U d(y|x) 2(1 h U(y|x)), x, y X Y. Mazuelas, Romero, and Gr unwald 4.2 Finite-sample generalization bounds This section provides MRCs generalization bounds in terms of optimal minimax rules, i.e., MRCs with the smallest minimax risk. Such rules correspond with uncertainty sets given by the true expectation of the feature mapping Φ, that is U = {p (X Y) : Ep{Φ(x, y)} = τ } (22) where the mean vector τ = Ep {Φ(x, y)} is the exact expectation with respect to the underlying distribution. The minimum wost-case error probability (minimax risk) over distributions in U is given by RΦ = min µ 1 τ T µ + ϕ(µ) = 1 τ T µ + ϕ(µ ) corresponding with the MRC given by parameters µ . Such classification rule is referred to as the optimal minimax rule for feature mapping Φ because for any uncertainty set U given by (4) that contains the underlying distribution p , we have that U U and hence RΦ R(U). The optimal minimax rule could only be obtained by an exact estimation of the expectation of the feature mapping Φ that in turn would require an infinite amount of training samples. The following result provides generalization bounds for MRCs in terms of excess error probability with respect to the minimax risk at learning and the smallest minimax risk for the feature mapping used. Theorem 7 Let U and U be uncertainty sets given by (4) and (22), respectively, that satisfy R1 or R2. If h U is a 0 -1 MRC for uncertainty set U, we have that R(h U) R(U) + (|τ τ| λ)T|µ | (23) R(h U) R(U) (|τ τ| λ)T|µ| (24) R(h U) RΦ + |τ τ|T|µ µ | + λT(|µ | |µ |) (25) with µ and µ solutions to (6) and (20), respectively. In particular, if λ is a confidence vector with coverage probability 1 δ, i.e., P{|τ τ| λ} 1 δ, then, we have that R(U) R(h U) R(U) (26) R(h U) RΦ + 2λT|µ | RΦ + 2 λ µ 1. (27) with probability at least 1 δ. Proof See Appendix F. Inequalities (23), (24), and (26) bound MRCs probabilities of error w.r.t. the corresponding minimax error probability R(U) and lower bound R(U). Such quantities can be obtained at learning, R(U) is given as a byproduct of the learning process that solves optimization Pτ,λ in (6) while R(U) requires to solve an additional convex optimization problem given by (20). Inequalities (25) and (27) bound MRCs probabilities of error w.r.t. Minimax Risk Classifiers with 0 -1 Loss the smallest minimax risk RΦ. As shown in the next section, this smallest minimax risk becomes the Bayes risk using rich feature mappings so that such generalization bounds enable to prove universal consistency results for MRCs. The generalization bounds in the above result depend on the accuracy of mean vector estimates and on the confidence vector used. Several important conclusions can be drawn from such bounds: The excess error probability with respect to minimax risks decreases at the same rate as the error of the mean vector estimates. As shown in Theorem 4, with wide generality the bounds in the above theorem show differences that decrease with n at a rate O(1/ n) using mean vector estimates given by sample averages as in (5). Methods that utilize 0 -1 loss but consider uncertainty sets with fixed marginals provide significantly coarser performance guarantees. In particular, the bounds in Theorem 3 of Farnia and Tse (2016) are O(1/ε n) where ε describes the norm of the error in sample mean estimates, which is commonly ε = O(1/ n). In addition, RRM methods based on Wasserstein distances achieve generalization bounds that decrease with n at a rate O(1/n1/d) that deteriorates with the instances dimensionality d (Shafieezadeh-Abadeh et al., 2019; Frogner et al., 2021). MRCs do not heavily rely on the choice of confidence vectors. For a given mean vector τ, the upper bound in (23) takes its smallest value for the MRC corresponding with the confidence vector λ given by the error |τ τ| in the mean vector estimate. Specifically, if Ue is the uncertainty set given by (4) with mean vector τ and confidence vector λ = |τ τ|, the upper bound in (23) becomes R(Ue), and, for any uncertainty set U given by (4) with mean vector τ, we have that R(Ue) = min µ 1 τ Tµ + ϕ(µ) + |τ τ|T|µ| 1 τ Tµ + ϕ(µ ) + λT|µ | + (|τ τ| λ)T|µ | = R(U) + (|τ τ| λ)T|µ | with µ solution of (6) for uncertainty set U. Near-optimal generalization bounds can be obtained using confidence vectors that approximate |τ τ|. Specifically, for any uncertainty set U as above, we have that R(Ue) R(U) + (|τ τ| λ)T|µ | = 1 τ Tµ + ϕ(µ ) + |τ τ|T|µ | 1 τ Tµe + ϕ(µe) + λT|µe| + (|τ τ| λ)T|µ | = R(Ue) + (|τ τ| λ)T(|µ | |µe|) with µe solution of (6) for uncertainty set Ue. Therefore, the difference between the upper bound in (23) for uncertainty set U and the smallest upper bound R(Ue) is bounded by (|τ τ| λ)T(|µ | |µe|), and both terms in such scalar product are small when λ |τ τ|. The minimax risk optimized at learning can offer an adequate assessment of the MRC s error probability even if the uncertainty set used does not include the underlying Mazuelas, Romero, and Gr unwald distribution. Such minimax risk R(U) obtained as a byproduct of the learning process provides valid upper bounds for the error probability of MRCs in cases where |τ τ| λ, i.e., p U. As shown in (23) and (24), R(U) and R(U) still provide approximate bounds in other cases as long as the confidence vector is not significantly smaller than the error in the mean vector estimates. High-confidence upper and lower bounds for error probabilities can be obtained using confidence vectors with high coverage probability. Such confidence vectors can be obtained using the expressions in Theorem 4 or numerical methods such as those proposed in Waudby-Smith and Ramdas (2023). Those vectors may not be adequate for MRC learning since they are often much larger than |τ τ|. Notice that a confidence vector λδ that ensures {λδ |τ τ|} occurs with high probability for any underlying distribution is likely to be much larger than |τ τ| for a particular training set drawn from a specific underlying distribution. However, confidence vectors with high coverage probability can be used to obtain error probability bounds for general MRCs that are valid with probability at least 1 δ. A simple way to get such bounds from R(U) and R(U) corresponding with a confidence vector λ follows by noticing that (23) and (24) imply that R(U) (λδ λ)T|µ| R(h U) R(U) + (λδ λ)T|µ | (28) with probability at least 1 δ because λδ |τ τ| with that probability. Another way to obtain high-confidence bounds follows by solving the convex optimization problems in Theorem 5 for the MRC corresponding to λ and the uncertainty set corresponding to λδ. Theorem 7 and the above discussion exhibit the different roles played by confidence vectors that aim to bound the error in mean vector estimates with high probability for any probability distribution versus those that only aim to approximate such error. The former can be used to obtain provably valid performance guarantees while the later can be used to obtain small error probability and approximate performance bounds. Such roles are further studied numerically in Section 6 using multiple real datasets. The improved performance of methods that use aggressively chosen parameters has been observed in multiple fields of machine learning. For instance, in on-line learning methods, the theoretical prescriptions for learning rates that lead to valid prediction guarantees are routinely outperformed by choices that minimize prediction error on the data seen so far but have worse guarantees (see e.g., Devaine et al. (2013)). This phenomenon is related to the fact that in-expectation generalization bounds can be significantly better than in-probability generalization bounds, and it is further explored in a PAC-Bayes setting by Gr unwald et al. (2021). The generalization bounds in Theorem 7 shed light on such phenomenon for the proposed MRCs. 4.3 Universal consistency We show next that MRCs can be strongly universally consistent using rich feature mappings, that is, as the training size grows, the MRC s error probability tends to the Bayes risk with probability one for any underlying distribution. Minimax Risk Classifiers with 0 -1 Loss The theorem below shows universal consistency for MRCs given by feature mappings determined by random features corresponding with rich RKHSs (see e.g., Rahimi and Recht (2008); Bach (2017)). Specifically, let v1, v2, . . . be a sequence of i.i.d. samples from distribution p(v) generating random features ψv : X R for kernel k : X X R, that is Ep(v){ψv(x)ψv(x )} = k(x, x ), x, x X. In addition, for n = 1, 2, . . . the feature map Φn is defined by the Dn random features ψv1, ψv2, . . . , ψv Dn as Φn(x, y) = ey [1, ψv1(x), ψv2(x), . . . , ψv Dn(x)]T. (29) Using such feature mappings, we have the following result. Theorem 8 For n = 1, 2, . . ., let hn be an MRC for uncertainty set Un given by (4) with Φ = Φn as in (29), τ = τ n as in (5) and λ = λn 0. If we have that (1) k : X X R is a characteristic kernel, (2) scalar features are bounded, i.e., there exists C > 0 such that |ψv(x)| < C for all v and x, (3) the number of random features Dn defining the feature mapping Φn is non-decreasing and tends to infinity, and (4) any component of {λn} is non-increasing and tends to 0. Then, the sequence of smallest minimax risks {RΦn} tends to the Bayes risk RBayes with probability one for any underlying distribution p . If, in addition to (1)-(4), we have that (5) any component of {λn p n/ log n} tends to , and (6) Dn = O(nk) for some k > 0. Then, the sequence of MRCs probabilities of error {R(hn)} tends to the Bayes risk RBayes with probability one for any underlying distribution p . Proof See Appendix G. The conditions under which MRCs are strongly universally consistent are analogous to those corresponding to conventional ERM methods based on SRM and RKHSs (Steinwart, 2005; Zhang, 2004), e.g., regularization parameters that tend to zero not very quickly and broad RKHSs. The universal consistency in ERM techniques is achieved using RKHSs given by universal kernels while the theorem above uses characteristic kernels, which is in general a slightly weaker condition (Muandet et al., 2017). In addition, the result above provides MRCs universal consistency for binary and multiclass cases and does not rely on surrogate losses. Mazuelas, Romero, and Gr unwald Notice that for ERM methods based on SRM, the decrease of the regularization parameter for increasing number of samples corresponds to consider broader families of rules (balls in the RKHS with increased radius). On the other hand, for the proposed approach, such decrease corresponds to consider reduced uncertainty sets. This fact illustrates how the bias-complexity trade-offaddressed in the SRM approach by controlling the generality of the family of classification rules is analogous to controlling the generality of the uncertainty set of probability distributions in the proposed approach (see also discussion in Section 2.1 above). In paticular, the conditions (5) and (6) in Theorem 8 result in uncertainty sets that shrink as we get more samples but contain the underlying distribution with high probability. 5. Efficient learning of MRCs The learning stage for MRCs consists on solving optimization problem (6) that obtains the parameters for h U and h U d as well as the minimax risk R(U). This stage can be complemented by solving optimization problems (20) and (21) that can provide tight performance guarantees. In this section we first show that such optimization problems can be simplified by using a reduced instances set given by instances samples. We then propose efficient subgradient methods that take advantage of the specific structure of the above mentioned optimization problems. 5.1 Reduced instances set Solving the optimization problems that learn MRCs and obtain their performance bounds can be inefficient in cases where the feature mapping range {Φ(x, y) : x X, y Y} has a large cardinality, since the evaluation of objective functions in (6), (20) and (21) may require to search over such a range. The next result shows that this difficulty can be avoided by using instances samples instead of the whole set X, e.g., using the instances obtained at training. Theorem 9 Let Xs = {x1, x2, . . . , xs} be s i.i.d. instances from the underlying distribution p (x), hs, Rs(U), and Rs(U) be the MRC and bounds obtained by solving (6) and (20) using Xs instead of X. If Rs(U) is finite and p U with probability at least 1 δ, we have that Rs(U) εs(2C µl 1 + 1) R(hs) Rs(U) + εs(2C µu 1 + 1) R(U) + εs(2C µu 1 + 1) with probability at least 1 2δ, where µl and µu are the solutions of (20) and (21), respectively, using Xs instead of X, Φ(x, y) C, x X, y Y, and εs is given by 4 + |Y|(m + 1) log s + log |Y|/δ Proof See Appendix H. The above result shows that a sufficiently large set of instances samples Xs can be safely used instead of the whole set X in optimization problems (6) and (20) since the subsequent approximation error is O( p log(s)/s). An analogous result can be obtained for deterministic MRCs and (21) using the same arguments as in Appendix H for the above result. Minimax Risk Classifiers with 0 -1 Loss The condition |Rs(U)| < can be easily satisfied in practice. Such condition is equivalent to Us = n p (Xs Y) : |Ep{Φ} τ| λ o = that is directly achieved if τ is obtained as the sample mean of samples with instances Xs. In other cases, it can be ensured that such uncertainty set is not empty by increasing the confidence vector and shifting the mean vector. In particular, if λ 1 and λ 2 are solutions of the linear optimization problem min p,λ1,λ2 1T(λ1 + λ2) s.t. τ λ1 X x Xs,y Y p(x, y)Φ(x, y) τ + λ2 λ1 λ, λ2 λ, p (Xs Y) with 2m + s|Y| variables and 4m + s|Y| + 1 constraints. Then, taking eτ = τ + λ 2 λ 1 2 and eλ = λ 1+λ 2 2 λ we have that Us e Us = n p (Xs Y) : |Ep{Φ} eτ| eλ o = . The numerical results of next Section 6 show that the optimization problems for MRC learning can be accurately solved in practice using quite reduced sets of instances. In particular, such results show that the set of instances X in optimization problems (6), (20), and (21) can be taken as the set Xn = {x1, x2, . . . , xn} of instances obtained at training. 5.2 Efficient optimization This section presents efficient optimization techniques for MRC learning that comprises to solve problems (6), (20) and (21). These three problems can be written as: min µ f(µ) := a Tµ + λT|µ| + max{Fµ + b} (30) with a Rm, b Rp and F Rp m. The size of vector a and the number of columns of matrix F, m, equals the number of components of the feature mapping Φ : X Y Rm, while the size of vector b and the number of rows of matrix F, p, is given by the number of instances Xs used to evaluate ϕ, e.g., the number of instances at training. Specifically, p equals s(2|Y| 1) in problem (6), and s|Y| in problems (20) and (21). The optimization problem (30) can be reformulated as the linear program (LP) min µ1,µ2 a T(µ1 µ2) + λT(µ1 + µ2) + ν s.t. Fµ1 Fµ2 + b ν1 µ1, µ2 0 by introducing new variables ν R and µ1, µ2 Rm with µ = µ1 µ2. Such an LP has p + 2m constraints and 2m + 1 variables, and can be solved with high accuracy by Mazuelas, Romero, and Gr unwald off-the-shelf solvers at the expenses of a high computational time in cases where p or m is large. Problem (30) belongs to the class of nondifferentiable convex optimization problems with subgradients that are uniformly bounded. The subgradient methods (SMs) are often an attractive option to solve this class of problems since they can efficiently provide solutions with adequate accuracy (Bertsekas, 2015). A subgradient of the objective function f in (30) at point µ can be directly obtained from the maximum of vector v = Fµ + b. Specifically, if i(µ) arg max v we have that a subgradient of f at µ is given by g(µ) = a + λ sign(µ) + coli(µ)(FT) (32) where denotes the Hadamard product, sign(µ) denotes the vector given by the signs of the components of µ, and coli( ) denotes the i-th column of the argument. Often, the main limitation of SMs is the large number of iterations required. The accelerated subgradient methods (ASMs) (Nesterov and Shikhman, 2015; Tao et al., 2020) have been developed to reduce the number of iterations by using the Nesterov s extrapolation strategy (Nesterov, 1983). In particular, Algorithm 1 describes the application of the ASM proposed in Tao et al. (2020) to problem (30) for MRC learning. Algorithm 1 ASM for MRC learning Output: approximate solution µ 1: Initialize µ1 and take c1 = θ1 = 1, η1 = 0 2: y1 µ1, v1 Fµ1 + b, i1 arg max v1, µ µ1, f a Tµ1 + λT|µ1| + v(i1) 1 3: for k = 1, 2, . . . do 4: gk a + λ sign(µk) + colik(FT) 5: yk+1 µk ckgk, µk+1 (1 + ηk)yk+1 ηkyk 6: vk+1 Fµk+1 + b, ik+1 arg max vk+1 7: ck+1 (k + 1) 3/2, θk+1 2 k+1, ηk+1 θk+1 1 θk 1 8: fk+1 a Tµk+1 + λT|µk+1| + v(ik+1) k+1 9: if fk+1 < f then 10: f fk+1, µ µk+1 11: end if 12: end for The time complexity per iteration of the SMs described above can be high in cases where p or m is large. This computational cost is due to the fact that the subgradient g(µk) is computed by evaluating vk = Fµk + b that requires pm multiplications. Such computation can be carried out in a significantly more efficient manner by exploiting the specific structure of the subgradient. The vector vk can be efficiently computed from vk 1 because we have that Fg(µk 1) = Fa + Fdiag(λ)sign(µk 1) + coli(µk 1)(FFT) and Fa, FFT, and Fdiag(λ) can be computed only once. In addition, the sequence of vectors {dk = Fdiag(λ)sign(µk)} can be obtained as dk = dk 1 + Fdiag(λ) k (33) Minimax Risk Classifiers with 0 -1 Loss where the vector k = sign(µk) sign(µk 1) takes values 2 in the components where the parameters µk 1 and µk change signs, takes value 0 in the components where they have the same sign, and would take values 1 only if some of the components of such parameters are exactly zero. Therefore, the computation vector dk in (33) can be carried out very efficiently in practice by adding or subtracting the columns of 2Fdiag(λ) corresponding to the components where the iterates change sign. Algorithm 2 shows the efficient implementation of ASM that exploits the specific structure of the subgradient as described above. The result below describes how such implementation can result in significant computational savings. Algorithm 2 Efficient ASM for MRC learning Output: approximate solution µ 1: Initialize µ1 and take c1 = θ1 = 1, η1 = 0, α = Fa, G = FFT, H = 2Fdiag(λ) 2: y1 µ1, v1 Fµ1 + b, w1 v1, s1 sign(µ1), d1 (1/2)Hs1 3: i1 arg max v1, µ µ1, f a Tµ1 + λT|µ1| + v(i1) 1 4: for k = 1, 2, . . . do 5: gk a + λ sk + colik(FT) 6: yk+1 µk ckgk, µk+1 (1 + ηk)yk+1 ηkyk 7: uk α + dk + colik(G) 8: wk+1 vk ckuk, vk+1 (1 + ηk)wk+1 ηkwk, ik+1 arg max vk+1 9: sk+1 sign(µk+1), k sk+1 sk, dk+1 dk 10: for i = 1, 2, . . . m do 11: if (i) k = 2 then 12: dk+1 dk+1 + coli(H) 13: else if (i) k = 2 then 14: dk+1 dk+1 coli(H) 15: else if (i) k { 1, 1} then 16: dk+1 dk+1 + (1/2)sign( (i) k )coli(H) 18: end for 19: ck+1 (k + 1) 3/2, θk+1 2 k+1, ηk+1 θk+1 1 θk 1 20: fk+1 a Tµk+1 + λT|µk+1| + v(ik+1) k+1 21: if fk+1 < f then 22: f fk+1, µ µk+1 23: end if 24: end for Theorem 10 The sequences {µk} generated by Algorithms 1 and 2 are identical, while the computational complexity of Algorithm 2 is significantly smaller at iterations k where sign(µk) differs from sign(µk 1) in few components. Specifically, if γK is the sparsity coefficient given by the average fraction of non-zero components of k for k = 1, 2, . . . , K; then, the computational complexity of Algorithm 2 after K iterations is O(KpmγK), while that of Algorithm 1 is O(Kpm). Mazuelas, Romero, and Gr unwald Proof See Appendix I. The computation in step 6 of Algorithm 1 that has cost O(pm) is effectively replaced by steps 7-18 in Algorithm 2 that have cost O pmγ( k) where γ( k) denotes the fraction of non-zero components of k. As the algorithm progresses and the subgradient steps decrease to zero, it is expected to have few changes in the signs of µk and therefore to have a highly sparse vector k. The usage of an ASM also contributes to a reduced number of sign changes since such method provides more stable iterations than basic subgradient methods. The method in Nesterov (2014) also exploits the sparsity in subgradient methods for piecewise linear functions but it considers problems given only by a term max{Fµ + b} and with a sparse matrix F. In order to further decrease the number of iterations, we also use restarts similarly to other methods for non-smooth optimization (Yang and Lin, 2018). Our numerical results confirm the efficiency of the implementation described in Algorithm 2 which resulted in a significant reduction (more than 90%) of the running time per iteration in comparison with Algorithm 1. 6. Numerical results This section shows four sets of numerical results that describe how the proposed techniques can enable to learn MRCs efficiently and to obtain reliable and tight performance bounds at learning. We utilize 12 common datasets from the UCI repository (Dua and Graff, 2017) with characteristics given in Table 1. MRCs implementation is available in the open-source Python library MRCpy (Bondugula et al., 2023) https://Machine Learning BCAM.github. io/MRCpy/. In this section, MRCs are implemented using a feature mapping given by random features corresponding with a Gaussian kernel (Rahimi and Recht, 2008; Bach, 2017), that is Φ(x, y) = ey Ψ(x) = ey [cos u T 1 x, sin u T 1 x, cos u T 2 x, sin u T 2 x, . . . , cos u T Dx, sin u T Dx] (34) where x is a normalized instance and u1, u2, . . . , u D are i.i.d. samples from a zero-mean Gaussian distribution with covariance (1/σ2)I for a scaling parameter σ. In addition, for n training samples (x1, y1), (x2, y2), . . . (xn, yn) we obtain confidence vectors as where υ is the vector formed by the sample variances of the feature mapping components. In all the numerical experiments, if not stated otherwise, we take λ0 = 0.3 and use D = 500 random Fourier features corresponding with the scaling parameter σ = p d/2 for d the number of instances components. We compare MRCs performance and bounds with those obtained by Wasserstein-based RRM and by PAC-Bayes methods. Specifically, we utilize the distributionally robust logistic regression (DRLR) method as described in Shafieezadeh Abadeh et al. (2015) with uncertainty sets defined by Wasserstein distances, and a support vector machine (SVM) with upper bounds given by PAC-Bayes as described in Langford (2005); Ambroladze et al. (2007). The first set of numerical results shows that MRCs optimization problems can be efficiently addressed in practice by using a reduced set of instances. Specifically, for s i.i.d. Minimax Risk Classifiers with 0 -1 Loss Table 1: Characteristics of the datasets used for experimentation. Data set Number of samples Instances dimensionality % Majority class Adult 48,842 14 76 Pulsar 17,898 8 91 Credit 690 15 56 QSAR 1055 41 66 Mammographic 961 5 54 Haberman 306 3 74 Ion 351 34 64 Heart 270 13 56 Liver 583 10 71 Blood 748 4 76 Diabetes 768 8 65 Audit 776 26 61 Number of instances s Classif cation error Lower bound Rs(U) Upper bound Rs(U) Error prob. R(hs) Lower bound R(U) Upper bound R(U) Error prob. R(h) 1000 2000 3000 4000 5000 (a) Adult dataset Number of instances s Classif cation error Lower bound Rs(U) Upper bound Rs(U) Error prob. R(hs) Lower bound R(U) Upper bound R(U) Error prob. R(h) 1000 2000 3000 4000 5000 (b) Pulsar dataset Figure 1: Decrease of optimization error with the number of instances used. instances Xs = {x1, x2, . . . , xs}, we solve optimization problems (6) and (20) taking X = Xs and we study how the error of such approximation decreases with increasing s. For these numerical results we use Adult and Pulsar datasets from UCI repository (Dua and Graff, 2017) that contain a total of 48,842 and 17,898 instances, respectively. We obtain τ and λ using 500 training samples. Then, we solve (6) and (20) taking X composed by all the instances in the dataset and taking X composed by a subset of randomly selected instances Xs of size s. In Figure 1, R(hs), Rs(U) and Rs(U) denote, as in Theorem 9, the error probability and bounds of the MRC obtained solving (6) and (20) taking X = Xs = {x1, x2, . . . , xs}. Solid curves describe the average results in 50 random repetitions, the shaded areas describe the intervals formed by the standard deviation around the averages, and dashed lines describe the results obtained taking X composed by all the instances. The figure shows that the Mazuelas, Romero, and Gr unwald Optimization error Running time [s] E-ASM E-ASM-R (a) Credit dataset Optimization error Running time [s] E-ASM E-ASM-R (b) Haberman dataset Optimization error Running time [s] E-ASM E-ASM-R (c) Mammographic dataset 10 3 2 3 4 5 Optimization error Running time [s] E-ASM E-ASM-R (d) Diabetes dataset Figure 2: The proposed efficient ASM s implementation that exploits the subgradients structure can significantly reduce the running time required to accurately learn MRCs. results obtained using a reduced set of instances quickly converge to those obtained using all the instances. These results agree with the theoretical guarantees for this approximation shown in Theorem 9 and show that the constants of the bounds in such result can be quite small in practice. In the remaining numerical results we solve optimization problems (6), (20), and (21) taking X as the n instances at training. The second set of numerical results assesses the performance of the efficient SMs presented in Section 5.2. We solve problem (6) for MRC learning using four methods: the basic SM (BSM) with step size ck = 1/( k + 1 gk ); the efficient BSM that exploits the subgradient structure as shown in Section 5.2 (E-BSM); the ASM described in Algorithm 1; the efficient ASM that exploits the subgradient structure as detailed in Algorithm 2 (E-ASM); and the efficient ASM described in Algorithm 2 combined with restarts every 10,000 iterations (E-ASM-R). Minimax Risk Classifiers with 0 -1 Loss Figure 2 shows the optimization error per running time averaged over 10 random partitions of four datasets. The results manifest the significant computational savings obtained by the efficient implementation presented in Section 5.2. Such efficient implementation is especially advantageous for ASM since it results in a sparsity coefficient around 10 3 while that achieved in BSM is around 10 1. The restart strategy achieves worse computing time per iteration since it results in sparsity coefficients around 10 2. However, such strategy provides an overall improvement in running time since it requires significantly less iterations. The third set of numerical results shows how the MRCs performance and bounds change by varying the confidence vectors used. We study the effect of varying the confidence vectors at learning and the performance guarantees obtained using confidence vectors with high coverage probability. In addition, we compare the results obtained in the practical case where confidence vectors are obtained from training samples with those obtained in the ideal case where the error in the mean vector estimates is known. Specifically, for the practical case we learn MRCs using λ as in (35) and obtain high-confidence performance guarantees using λδ for δ = 0.05 given by the confidence intervals proposed in Waudby-Smith and Ramdas (2023). For the ideal case, we learn MRCs using λ = λ0|τ τ|, and obtain high-confidence performance guarantees using λδ = |τ τ|. In particular, a simple high-confidence upper bound is obtained using (28), and a tighter high-confidence upper bound is obtained using Theorem 5 with λ = λδ. In order to reproduce the ideal case, in these numerical results we use Adult and Pulsar datasets, the mean τ is calculated using all the samples while 1, 000 samples are randomly sampled for training in 50 repetitions. Figure 3 shows the error probability R(h U) and upper bound R(U) of MRCs corresponding with different values of λ0, together with the simple high-confidence upper bound obtained using (28), and the tighter high-confidence upper bound obtained using Theorem 5. Such figure shows that MRCs do not heavily rely on the choice of the confidence vector. In particular, the error probability obtained using λ as in (35) is similar to the error probability that would be obtained using the actual difference |τ τ|, and the minimax risk R(U) optimized at learning is similar to the error probability for most values of λ0. In addition, the usage of confidence vectors with high coverage probability can result in performance guarantees that are both valid with high probability and informative. Such numerical results are in agreement with the conclusions drawn from Theorem 7 in Section 4.2. In particular, Figure 3 shows that the smallest valid upper bound is obtained using λ = |τ τ| but more aggressive confidence vectors can result in improved error probability and yet provide useful performance bounds. We also compare the performance and bounds of MRCs for varying confidence vectors with those obtained by DRLRs for varying Wasserstein radius. Figure 4 shows the error probability and performance bounds of MRCs and DRLRs obtained by varying 0 λ0 1 and the Wasserstein radius 10 5 ρ 1, respectively. In particular, for each value of λ0 and ρ we averaged over 10-fold stratified partitions the error and bounds of MRCs and DRLRs. Figure 4 shows that MRCs can provide tighter performance bounds than existing techniques based on RRM with Wasserstein distances. The figure also shows that, even for datasets of similar size, DRLR methods require to fine-tune the Wasserstein radius ρ in order to obtain reliable and tight performance bounds while MRCs can utilize a confidence parameter λ0 that does not strongly depend on the dataset. Figure 4 also shows that deterministic 0 -1 Mazuelas, Romero, and Gr unwald Classif cation error Upper bound Tight high conf. upper bound Simple high conf. upper bound Error probability 0.4 0 0 0.5 1 1.5 2 (a) Adult dataset, ideal case Classif cation error Upper bound Tight high conf. upper bound Simple high conf. upper bound Error probability 0.4 0 0 0.5 1 1.5 2 (b) Adult dataset, practical case Classif cation error Upper bound Tight high conf. upper bound Simple high conf. upper bound Error probability 0 0 0.5 1 1.5 2 (c) Pulsar dataset, ideal case (d) Pulsar dataset, practical case Figure 3: Error probabilities and performance bounds of MRCs for different choices of confidence vectors in comparison with the ideal case that use the actual error in mean vector estimates. Minimax Risk Classifiers with 0 -1 Loss (a) MRC performance bounds and classification error in credit dataset. (b) DRLR performance bounds and classification error in credit dataset. (c) MRC performance bounds and classification error in QSAR dataset. (d) DRLR performance bounds and classification error in QSAR dataset. (e) MRC performance bounds and classification error in mammographic dataset. (f) DRLR performance bounds and classification error in mammographic dataset. Figure 4: MRCs performance bounds for varying confidence vectors in comparison with those obtained by RRM techniques based on Wasserstein distances. Mazuelas, Romero, and Gr unwald Table 2: Classification error of MRC with model selection based on the performance bounds in comparison with state-of-the-art techniques. Data set SVM-CV SVM-PAC DRLR-UB MRC-UB Det. MRC-UB Haberman .26 .27 .32 .25 .25 Heart .17 .17 .23 .21 .18 Liver .28 .28 .33 .29 .28 Blood .22 .23 .25 .24 .22 Credit .14 .13 .20 .15 .14 Diabetes .23 .23 .30 .27 .24 Ion .08 .08 .07 .12 .07 QSAR .11 .12 .14 .18 .12 Mammographic .17 .17 .21 .19 .18 Audit .05 .06 .04 .07 .06 MRCs, h U d, can obtain improved classification error in practice at the expenses of less tight performance guarantees. The fourth set of numerical results shows how MRCs performance bounds can be used for model selection. We compare the performance obtained by selecting the scaling parameter of a Gaussian kernel using four methods, one based on conventional cross-validation and three based on error upper bounds. Specifically, SVM-CV selects the scaling parameter for which a SVM classifier achieves the smallest cross-validation error over 10-fold partitions of the training data. SVM-PAC, DRLR-UB, and MRC-UB select the scaling parameter with smallest error upper bound in training. Such upper bounds are obtained for SVM-PAC by using PAC-Bayes methods as in Langford (2005); Ambroladze et al. (2007), while for DRLR-UB and MRC-UB the upper bounds are obtained as the value the optimization problem solved at learning as shown in Shafieezadeh-Abadeh et al. (2015) and in equation (6), respectively. For all datasets, MRCs utilize confidence vectors as in (35) with λ0 = 0.3 and DRLRs utilize Wasserstein radious of ρ = 0.003 as in Shafieezadeh-Abadeh et al. (2015). Table 2 shows the classification error obtained by the methods compared in 10 datasets. Specifically, we generate 20 random stratified splits with 20% test samples. In each split, we utilize the training samples to learn a classifier selecting a scaling parameter over 20 uniformly spaced candidates between the 10th and 90th percentiles of the Euclidean distances among normalized instances. Then, the error of the corresponding classifier is estimated using the test samples and the final values in the table are obtained by averaging the results over the 20 random splits of the data. Table 2 shows that the proposed MRCs as well as methods based on PAC-Bayes can obtain similar performance to that obtained by conventional methods based on cross-validation. However, MRCs and PAC-Bayes methods require a significantly smaller complexity since they only need to carry out one optimization per parameter value while cross-validation methods require to carry out such optimization 10 times per parameter value. In addition, the performance bounds obtained by MRCs can be used not only for model selection but also to obtain tight estimates for the classification error. Figure 5 shows the differences between performance bounds and error probabilities for all Minimax Risk Classifiers with 0 -1 Loss Figure 5: Boxplots with differences between error probability and bounds for PAC, DRLR, and MRC methods over multiple datasets and hyperparameter configurations. the splits, datasets, and hyper-parameters in this fourth set of numerical results. The figure shows that the performance bounds provided by MRCs are much tighter than those based on PAC-Bayes and much more reliable that those based on RRM that use Wasserstein balls without a fine-tuned radious. In particular, the difference between the classification error and the performance bounds are around 0.05 for MRCs and around 0.1 for deterministic MRCs. 7. Conclusion The paper presents minimax risk classifiers (MRCs) that minimize the worst-case 0 -1 loss over general classification rules and provide tight performance guarantees. We show how the out-of-sample performance of MRCs can be reliably estimated at learning, and that the MRCs error due to finite training sizes is determined by the accuracy of expectation estimates. In addition, we show that MRCs are strongly universally consistent in situations analogous to those corresponding with kernel-based methods. The proposed methodology can offer classification techniques that do not rely on specific training samples and choices for surrogate losses/hypothesis classes. Instead, MRC learning is based on expectation estimates, and its inductive bias comes only from a feature mapping that determines which expectations are estimated. Therefore, the methods presented can provide techniques that are robust to practical situations that defy common assumptions, e.g., training samples that follow a different distribution or display heavy tails. Mazuelas, Romero, and Gr unwald Acknowledgments Funding in direct support of this work has been provided by projects PID2022-137063NB-I00, CNS2022-135203, and CEX2021-001142-S funded by MCIN/AEI/10.13039/501100011033 and the European Union Next Generation EU /PRTR, and programes ELKARTEK and BERC-2022-2025 funded by the Basque Government. Appendix A. Lemma 11 Let U be an uncertainty set given by (4). For any h T(X, Y), we have that sup p U ℓ(h, p) inf µ Rm 1 τ Tµ + sup x X,y Y Φ(x, y)Tµ h(y|x) + λT|µ| (36) In addition, if U that satisfies R1 or R2, we have that sup p U ℓ(h, p) = min µ Rm 1 τ Tµ + sup x X,y Y Φ(x, y)Tµ h(y|x) + λT|µ| (37) Proof In the first step of the proof we show that the right hand side of (36) is equivalent to the Fenchel dual of the left hand side, then the second step of the proof shows that strong duality holds if U satisfies R1 or R2. Let 0 r m be the number of non-zero components of λ (without loss of generality we assume such components are the r first components of λ), and M(X Y) be the set of signed Borel measures over X Y with bounded total variation. The set M(X Y) is a Banach space with the total variation norm (see e.g., Chapter 10 in Aliprantis and Border (1994)). If A is the linear mapping A : M(X Y) Rr+m+1 p 7 [ R Φ1(x, y)dp(x, y), R Φ1(x, y)dp(x, y), R Φ2(x, y)dp(x, y), R dp(x, y)] (38) where Φ1(x, y) Rr denotes the first r components of Φ(x, y) and Φ2(x, y) Rm r denotes the last m r components of Φ(x, y). A is bounded and its adjoint operator is A : Rr+m+1 F(X Y) (M(X Y)) µ1, µ2, µ3, ν 7 Φ1( , )T(µ1 µ2) + Φ2( , )Tµ3 + ν. (39) where F(X Y) is the set of bounded Borel measurable functions over X Y. Then, we have that sup p U ℓ(h, p) = sup p U 1 Z h(y|x)dp(x, y) = 1 inf p M(X Y) f(p) + g(A(p)) (40) where f and g are the lower semi-continuous convex functions g : Rr+m+1 R { } (a1, a2, a3, b) 7 0 if a1 τ 1 + λ1, a2 τ 1 + λ1, a3 = τ 2, b = 1 otherwise (41) Minimax Risk Classifiers with 0 -1 Loss for τ 1, λ1 Rr (resp. τ 2, λ2 Rm r) given by the first r (resp. last m r) components of τ and λ, and f : M(X Y) R { } p 7 R h(y|x)dp(x, y) if p is nonnegative otherwise. (42) Then, the Fenchel dual (see e.g., Borwein and Zhu (2004)) of (40) is 1 sup µ1,µ2,µ3,ν f (A (µ1, µ2, µ3, ν)) g ( µ1, µ2, µ3, ν) (43) where f and g are the conjugate functions of f and g. If w F(X Y), we have that f (w) = sup p 0 Z (w(x, y) h(y|x))dp(x, y) = 0 if w(x, y) h(y|x), x, y X Y otherwise and g ( µ1, µ2, µ3, ν) is given by sup a T 1 µ1 a T 2 µ2 τ T 2 µ3 ν s.t. a1 τ 1 + λ1, a2 τ 1 + λ1 = (τ 1 + λ1)Tµ1 + (τ 1 λ1)Tµ2 τ T 2 µ3 ν if µ1 0, µ2 0 otherwise. Hence, the dual problem (43) becomes inf µ1,µ2,µ3,ν 1 (τ 1 + λ1)Tµ1 ( τ 1 + λ1)Tµ2 τ T 2 µ3 ν s.t. Φ1(x, y)T(µ1 µ2) + Φ2(x, y)Tµ3 + ν h(y|x), (x, y) X Y µ1 0, µ2 0 = inf µ1,µ2,µ3,ν 1 + (τ 1 + λ1)Tµ1 + ( τ 1 + λ1)Tµ2 τ 2µ3 ν s.t. Φ1(x, y)T(µ2 µ1) + Φ2(x, y)Tµ3 + ν h(y|x), (x, y) X Y µ1 0, µ2 0 (44) = inf µ,ν 1 τ Tµ + λT|µ| ν s.t. Φ(x, y)Tµ + ν h(y|x), (x, y) X Y (45) = inf µ 1 τ Tµ + sup (x,y) X Y Φ(x, y)Tµ h(y|x) + λT|µ|. (46) The expression in (45) is obtained taking µ = [(µ2 µ1)T, µT 3 ]T. Firstly, in (44) we can consider only pairs µ1, µ2 such that µ1 + µ2 = |µ1 µ2| because for any pair µ1, µ2 feasible in (44), we have that µ1 = (µ2 µ1)+, µ2 = (µ1 µ2)+ is a feasible pair because µ1 µ2 = µ1 µ2, and we also have that λT 1 | µ1 µ2| = λT 1 ( µ1 + µ2) λT 1 (µ1 + µ2). Then, we obtain (45) from (44) because τ T 1 (µ2 µ1) + τ T 2 µ3 = τ Tµ λT 1 |µ2 µ1| = λT|µ|, since λ2 = 0 Φ1(x, y)T(µ2 µ1) + Φ2(x, y)Tµ3 = Φ(x, y)Tµ. Mazuelas, Romero, and Gr unwald The expression in (46) is obtained since for any feasible (µ, ν) in (45) we have that (µ, ν) is feasible if ν = inf (x,y) X Y h(y|x) Φ(x, y)Tµ and ν ν. Then, the inequality in (36) follows by weak duality. For the second step of the proof, if U satisfies R1, we have that strong duality holds because the dual becomes the Lagrange dual and the constraints in (40) are linear affine (see e.g., Chapter 5 in Boyd and Vandenberghe (2004)). If U satisfies R2, we show in the following that strong duality holds because 0 int(dom g Adom f) (see e.g., Chapter 4 in Borwein and Zhu (2004)), where dom denotes the set where an extended-valued function takes finite values, and int denotes the interior of a set. If U satisfies R2.1, there exists R > 0 such that R < λ(i) |EpΦ(i)(x, y) τ (i)| for i = 1, 2, . . . , m. Then, the second step of the proof is obtained by showing that if 0 < ε < R/(R + 1 + Ep{Φ} 2) < 1, we have that the ball with radius ε centered in 0 Rr+m+1, B(0, ε) satisfies B(0, ε) (dom g Adom f) Rr+m+1. We have that for any z B(0, ε) R2m+1, there exist ξ1, ξ2 B(0, R) Rm such that z(1), z(2), . . . , z(m) = Ep{Φ} + ξ1 (1 z(2m+1))Ep{Φ} z(m+1), z(m+2), . . . , z(2m) = Ep{Φ} + ξ2 + (1 z(2m+1))Ep{Φ} z(2m+1) = 1 (1 z(2m+1)) because we have that z(1), z(2), . . . , z(m) z(2m+1)Ep{Φ} 2 ε(1 + Ep{Φ} 2) < R z(m+1), z(m+2), . . . , z(2m) + z(2m+1)Ep{Φ} 2 ε(1 + Ep{Φ} 2) < R. Then, the result is obtained observing that Ep{Φ} + ξ1, Ep{Φ} + ξ2, 1 dom g because R < λ(i) |EpΦ(i)(x, y) τ (i)| for i = 1, 2, . . . , m, and (1 z(2m+1))Ep{Φ}, (1 z(2m+1))Ep{Φ}, (1 z(2m+1)) Adom f because |z(2m+1)| ε < 1 and hence (1 z(2m+1))p is a nonnegative measure. If U satisfies R2.2, we have that Ep{Φ} int(Conv S), where S denotes the support of Φ(x, y) if (x, y) p, and Conv denotes the convex hull of a set. Firstly, int(Conv S) = because S and hence (Conv S) S are not contained in a proper affine subspace. In the case that Ep{Φ} Conv S\Conv S, using the Hanh-Banach separation theorem (see e.g., Theorem 2, Sec 5.12 in Luenberger (1997)), there would exists a hyperplane Γ = {z; αTz = c} Rm such that Ep{Φ} Γ and αTz c for all z Conv S. Then, the real-valued random variable αTΦ(x, y) c with (x, y) p is nonnegative with probability one and has expectation zero. Therefore, αTΦ(x, y) = c with probability one, and we would have that the S is contained in the hyperplane Γ, which leads to a contradiction. Let R > 0 be such that Minimax Risk Classifiers with 0 -1 Loss 1. R < λ(i) |EpΦ(i)(x, y) τ (i)| for i = 1, 2, . . . , r, and 2. for any ξ in the Euclidean ball of Rm centered at 0 with radius R, B(0, R), we have that Ep{Φ}+ξ Conv S, that is, there exists pξ (X Y) such that Epξ{Φ} = Ep{Φ}+ξ. As in the previous case, the second step of the proof is obtained by showing that if 0 < ε < R/(R + 1 + Ep{Φ} 2) < 1, we have that B(0, ε) (dom g Adom f) Rr+m+1. If r = 0, we have that for any z B(0, ε) Rm+1, there exist ξ3 B(0, R) such that z(1), z(2), . . . , z(m) = Ep{Φ} (1 z(m+1))(Ep{Φ} + ξ3) z(m+1) = 1 (1 z(m+1)) because we have that z(1), z(2), . . . , z(m) + z(m+1)Ep{Φ} 2 1 zm+1 ε 1 ε(1 + Ep{Φ} 2) < R. Then, the result is obtained observing that Ep{Φ}, 1 dom g and (1 z(m+1))(Ep{Φ} + ξ3), (1 z(m+1)) Adom f because |z(m+1)| ε < 1 and hence (1 z(m+1))pξ3 is a nonnegative measure, for pξ3 satisfying Epξ3{Φ} = Ep{Φ} + ξ3. If 0 < r < m, we have that for any z B(0, ε) Rr+m+1, there exist ξ1, ξ2 B(0, R) Rr, and ξ3 B(0, R) Rm r such that z(1), z(2), . . . , z(r) = Ep{Φ1} + ξ1 (1 z(r+m+1))Ep{Φ1} z(r+1), z(2), . . . , z(2r) = Ep{Φ1} + ξ2 + (1 z(r+m+1))Ep{Φ1} z(2r+1), z(2r+2), . . . , z(r+m) = Ep{Φ2} (1 z(r+m+1))(Ep{Φ2} + ξ3) z(r+m+1) = 1 (1 z(r+m+1)) because we have that z(1), z(2), . . . , z(m) z(r+m+1)Ep{Φ1} 2 ε(1 + Ep{Φ} 2) < R z(r+1), z(r+2), . . . , z(2r) + z(r+m+1)Ep{Φ1} 2 ε(1 + Ep{Φ} 2) < R z(2r+1), z(2r+2), . . . , z(r+m) + z(r+m+1)Ep{Φ2} 2 1 zr+m+1 ε 1 ε(1 + Ep{Φ} 2) < R. Then, the result is obtained observing that Ep{Φ1} + ξ1, Ep{Φ1} + ξ2, Ep{Φ2}, 1 dom g because R < λ(i) |EpΦ(i)(x, y) τ (i)| for i = 1, 2, . . . , r, and (1 z(r+m+1)) Ep{Φ1}, Ep{Φ1}, Ep{Φ2} + ξ3, 1 Adom f Mazuelas, Romero, and Gr unwald because |z(r+m+1)| ε < 1 and hence (1 z(r+m+1))pξ3 is a nonnegative measure, for pξ3 satisfying Epξ3{Φ1} = Ep{Φ1} and Epξ3{Φ2} = Ep{Φ2} + ξ3 that exists because (0T, ξT 3 )T 2 R. Finally, since strong duality holds and U is not empty we have that the optimal value in (37) is finite and hence the optimal in the dual is attained (Borwein and Zhu, 2004) and the inf in (36) becomes min . Appendix B. Proof of Theorem 2 Proof Using Lemma 11 we have that inf h T(X,Y) sup p U ℓ(h, p) = inf h,µ 1 τ Tµ + λT|µ| + sup x X,y Y {Φ(x, y)Tµ h(y|x)} = min µ 1 τ Tµ + λT|µ| + inf h sup x X,y Y {Φ(x, y)Tµ h(y|x)} inf h sup x X,y Y {Φ(x, y)Tµ h(y|x)} = inf h,ν ν s.t. Φ(x, y)Tµ h(y|x) ν, x X, y Y. Then, the result is obtained because Φ(x, y)Tµ h(y|x) ν, x X, y Y h(y|x) Φ(x, y)Tµ ν, x X, y Y y C (Φ(x, y)Tµ ν) 1, x X, C Y P y C Φ(x, y)Tµ 1 |C| , x X, C Y with ϕ(µ) given by (7). For each µ, there exist classification rules h satisfying h(y|x) Φ(x, y)Tµ ϕ(µ), x X, y Y by definition of ϕ(µ). Hence, such classification rules are solution of inf h sup x X,y Y {Φ(x, y)Tµ h(y|x)} with optimal value ϕ(µ). Minimax Risk Classifiers with 0 -1 Loss Appendix C. Proof of Theorem 3 Proof The result can be proven analogously of that in Theorem 2 shown in Appendix B. Firstly, for each h T(X, Y), we have that supp V ℓ(h, p) = 1 min p h Tp + I+(p) s.t. P y Y p(xi, y) = 1 n, i = 1, 2, . . . , n τ λ ΦTp τ + λ where p, h, and Φ denote the vectors and matrix with rows p(xi, y), h(y|xi) and Φ(xi, y)T, respectively, for y Y, i = 1, 2, . . . , n, and I+(p) = 0 if p 0 otherwise. Optimization problem (47) has Fenchel (Lagrange) dual 1 max µ1,µ2,ν τ λ Tµ1 τ + λ Tµ2 1 n Pn i=1 ν(i) f (Φ(µ1 µ2) eν) s.t. µ1, µ2 0 where eν is the vector in Rn|Y| with component corresponding with (xi, y) for i = 1, 2, . . . , n, y Y given by ν(i), and f is the conjugate function of f(p) = h Tp + I+(p) given by f (w) = sup p 0 w Tp h Tp = 0 if w h otherwise. Therefore, the Lagrange dual above becomes 1 max µ1,µ2,ν τ λ Tµ1 τ + λ Tµ2 1 n Pn i=1 ν(i) s.t. µ1, µ2 0 Φ(xi, y)T(µ1 µ2) ν(i) h(y|xi), y Y, i = 1, 2, . . . , n. It is easy to see that the solution of such optimization problem µ1, µ2 satisfies that µ(i) 1 µ(i) 2 = 0 for any i such that λi > 0. Then λT( µ1 + µ2) = λT| µ1 µ2| and taking µ = µ1 µ2 the Lagrange dual above is equivalent to 1 max µ,ν τ Tµ λT|µ| 1 n Pn i=1 ν(i) Φ(xi, y)Tµ ν(i) h(y|xi), y Y, i = 1, 2, . . . , n that has the same value as supp V ℓ(h, p) since the constraints in (47) are affine and V is non-empty. Therefore, inf h T(X,Y) sup p V ℓ(h, p) = inf h,µ,ν 1 τ Tµ + λT|µ| + 1 Φ(xi, y)Tµ ν(i) h(y|xi), y Y, i = 1, 2, . . . , n Mazuelas, Romero, and Gr unwald and, similarly to the proof for Theorem 2, we have that Φ(xi, y)Tµ ν(i) h(y|xi), y Y, i = 1, 2, . . . , n y C Φ(xi, y)Tµ ν(i) 1, C Y, i = 1, 2, . . . , n P y C Φ(xi, y)Tµ 1 |C| , C Y, i = 1, 2, . . . , n ν(i) ϕ(µ, xi), i = 1, 2, . . . , n. Therefore, for each µ, we have that any classification rule satisfying h(y|x) Φ(x, y)Tµ ϕ(µ, x), x X, y Y is solution of inf h,ν 1 n Φ(xi, y)Tµ ν(i) h(y|xi), y Y, i = 1, 2, . . . , n that has optimal value 1 n Pn i=1 ϕ(µ, xi). Then, the result is obtained because for any x X, we have that X Φ(x, y)Tµ ϕ(µ, x) because otherwise there would exist νx < ϕ(µ, x) such that Φ(x, y)Tµ νx + = max C Y y C Φ(x, y)Tµ νx which contradicts the definition of ϕ(µ, x). Appendix D. Proof of Theorem 4 Proof The first result is a direct consequence of Hoeffding s inequality and the union bound since each component of Φ is bounded by its corresponding scalar feature. Similarly, the second result is a consequence of the empirical Bernstein inequality in Maurer and Pontil (2009). Minimax Risk Classifiers with 0 -1 Loss For the last result, we have that for each j Y i=1 ψ(xi)I{yi = j} E ψ(x)I{y = j} i=1 ψ(xi)I{yi = j} p (y = j)E{ψ(x|y = j)} i=1 ψ(xi)I{yi = j} E{ψ(x|y = j)} + E{ψ(x|y = j)} nj n p (y = j) i=1 ψ(xi)I{yi = j} E{ψ(x|y = j)} n p (y = j) For the first term, we have that with probability at least 1 δ i=1 ψ(xi)I{yi = j} E{ψ(x|y = j)} 2Rnj(F) + 2C p log 2/δ p2nj using the uniform concentration bound in terms of Rademacher complexities (see e.g., Mohri et al. (2018)). For the second term, we have that with probability at least 1 δ n p y(y = j) using Hoeffding s inequality. Therefore, the result is obtained using the union bound. Appendix E. Proof of Theorem 5 Proof The result for R(U, h) is a direct consequence of Lemma 11, and the second result for R(U, h) is obtained analogously since inf p U ℓ(h, p) = sup p U 1 Z h(y|x) dp(x, y) inf µ 1 τ Tµ + λT|µ| + sup x X,y Y {Φ(x, y)Tµ + h(y|x)} that leads to the expression in (20) changing the notation for the variable in the optimization from µ to µ. Mazuelas, Romero, and Gr unwald Appendix F. Proof of Theorem 7 Proof Let U be the uncertainty set in (22). It is clear that p U , then using Theorem 5 R(h U) R(U , h U) = min µ 1 τ T µ + sup x X,y Y {Φ(x, y)Tµ h U(y|x)} 1 τ T µ + sup x X,y Y {Φ(x, y)Tµ h U(y|x)} 1 τ T µ + sup x X,C Y P y C Φ(x, y)Tµ 1 = R(U) + (τ τ )Tµ λT|µ | where (48) is due to the definition of h U. Analogously, R(h U) R(U , h U) = max µ 1 τ T µ + inf x X,y Y{Φ(x, y)Tµ h U(y|x)} 1 τ T µ + inf x X,y Y{Φ(x, y)Tµ h U(y|x)} = R(U) + (τ τ )Tµ + λT|µ|. For inequality (25), note that from (48) and using the definition of µ we have that R(h U) 1 τ Tµ + sup x X,C Y P y C Φ(x, y)Tµ 1 |C| + λT|µ | λT|µ | + (τ τ )µ = RΦ + (τ τ)T(µ µ ) + λT(|µ | |µ |) Finally, the results in (26) and (27) are obtained analogusly as the previous result using U instead of U . Appendix G. Proof of Theorem 8 Proof In the first step of the proof we show that if Vn is the uncertainty set Vn = {p (X Y) : τ n λn Ep{Φn} τ n + λn} with τ n = Ep {Φn} and λn satisfying condition (4) in the theorem s statement, then, we have that {R(Vn)} tends to RBayes with probability one for any underlying distribution p . In the second step of the proof we obtain the result using Theorem 7 and the Borel-Cantelli Lemma. For the first step, we have that R(Vn) for n = 1, 2, . . . is a non-increasing sequence because for n = 1, 2, . . . the sequence Dn is non-decreasing and any component of λn is non-increasing. In addition, R(Vn) for n = 1, 2, . . . is lower bounded by RBayes since p Vn for any n. Then, the first step in the proof is obtained showing that RBayes is the largest lower bound and using the monotone convergence theorem. Specifically, in case there exists L R such that for all n R(Vn) L > RBayes = R({p }) Minimax Risk Classifiers with 0 -1 Loss then, it would exist a distribution p = p with p Vn for all n because Vn1 Vn2 if n1 n2. Since p , p Vn for all n, we have that 2λn Ep {Φn} Ep {Φn} 2λn (49) In particular, for all n Ep ey Ep ey 2 λn so that p y = p y. Using again (49) and the definition of Φn, denoting Ψn(x) = [ψv1(x), ψv2(x), . . . , ψv Dn(x)]T we get that 1 Dn Ep {ey Ψn} Ep {ey Ψn} 2 2 4|Y| λn 2 n 0 because any component of λn tends to 0. Therefore, for all y Y we have that x X Ψn(x)dp (x, y) Z x X Ψn(x)dp (x, y) 2 so that for j Y such that p y(j) = 0 x X Ψn(x)dp x|y=j Z x X Ψn(x)dp x|y=j 2 Ψn(x)TΨn(x ) Dn d(p x|y=j p x|y=j)d(p x |y=j p x |y=j) n 0. Using the law of large numbers we have that with probability one Ψn(x)TΨn(x ) Dn n k(x, x ) x X,x X k(x, x )d(p x|y=j p x|y=j)d(p x |y=j p x |y=j) = 0 Z k(x, )dp x|y=j Z k(x, )dp x|y=j for H the RKHS given by kernel k. But equality (50) contradicts p = p because k is a characteristic kernel and p y = p y for all y Y. As a consequence of the previous result, we also get that RΦn converges with probability one to RBayes since the smallest minimax risk satisfies RΦn = R(Un ) with Un = {p (X Y) : Ep{Φn(x, y)} = τ n } that coincides with Vn above taking λn = 0 for all n. For the second step, if λn and Dn satisfy the two additional conditions in the theorem s statement, let N0 be an integer such that any component of λn is larger than 2 log(|Y|(Dn + 1)) + 2 log 2n2 Mazuelas, Romero, and Gr unwald for any n N0. Such N0 exists since any component of λn p n/ log n tends to and Dn = O(nk) for some k > 0. Then, for n N0 we have that p Un with probability at least 1 1/n2 because using Hoeffding s inequality we have that 2 log(|Y|(Dn + 1)) + 2 log(2/(1/n2)) with probability at least 1 1/n2. Therefore, since Vn satisfies R2.1, using Lemma 11 we have that with probability at least 1 1/n2 R(hn) R(Un) inf µ 1 τ T nµ + ϕ(µ) + λT n|µ| 1 τ T n µn + ϕ( µn) + λT n| µn| = R(Vn) + (τ n τ n)T µn where µn is the solution of (6) for τ = τ n and λ = λn. If λn is the smallest component of λn, we have that µn 1 1/λn because 0 1 (τ n )T µn + ϕ( µn) + λT n| µn| = R(Vn) 1 1 (τ n )T µn + ϕ( µn) R(Un ) 0. Therefore, with probability at least 1 1/n2 R(hn) R(Vn) + τ n τ n µn 1 R(Vn) + C 2 log(|Y|(Dn + 1)) + 2 log 2n2 Let ε > 0 and consider N N0 such that for any n N R(Vn) < RBayes + ε 2 log(|Y|(Dn + 1)) + 2 log 2n2 Such N exists because R(Vn) RBayes and nλn log n when n tends to infinity, and Dn = O(nk). Therefore, X n 1 P{R(hn) RBayes ε} N 1 + X so that the result follows using the Borel-Cantelli Lemma. Minimax Risk Classifiers with 0 -1 Loss Appendix H. Proof of Theorem 9 Proof We first show that optimization problems (20) and (21) using Xs instead of X are equivalent to those using subsets of X that cover most of the probability mass of the underlying distribution. We then prove that the probabilities of error in such subsets are near the probabilities of error in all the set X. Let µu be a solution of the optimization problem min µ 1 τ Tµ + max x Xs,y Y Φ(x, y)Tµ hs(y|x) + λT|µ| (51) and µl be a solution of the optimization problem max µ 1 τ Tµ + min x Xs,y Y Φ(x, y)Tµ hs(y|x) λT|µ| (52) Both µu and µl exist because Xs is finite and Rs(U) is finite. If Xu and Xl are the sets Xu = n x X : Φ(x, y)Tµu hs(y|x) max x Xs Φ(x, y)Tµu hs(y|x) , y Y o Xl = n x X : Φ(x, y)Tµl hs(y|x) min x Xs Φ(x, y)Tµl hs(y|x) , y Y o we have that optimization problem (51) is equivalent to that obtained substituting Xs by Xu because the objective function of the former problem is a lower bound for the latter and both coincide in µu as a direct consequence of the definition of Xu. Similarly, (52) is equivalent to that obtained substituting Xs by Xl. We next proof that with probability at least 1 δ over the randomness of Xs, the sets Xu and Xl have probability larger than 1 εs with respect to the probability distribution p (x). For each j Y, let µu,j and µl,j be the vectors of size |Y|(m+1) formed by concatenating the vectors [µT u , 1]TI{y = j} and [µT l , 1]TI{y = j} for y = 1, 2, . . . , |Y|, respectively. In addition, for each j Y, let du,j, dl,j R be given by du,j = max x Xs Φ(x, j)Tµu hs(j|x) dl,j = min x Xs Φ(x, j)Tµl hs(j|x) . Then, denoting u(x) = [Φ(x, 1)T, hs(1|x), Φ(x, 2)T, hs(2|x), . . . , Φ(x, |Y|)Ths(|Y||x), ]T R|Y|(m+1) we have that Xu = {x X : u(x)Tµu,j du,j, j Y} Xl = {x X : u(x)Tµl,j dl,j, j Y}. Hence, if A is the set of half spaces in R|Y|(m+1), and S(A, s) denotes the s-th shatter coefficient of A (see e.g., Theorem 12.5 in Devroye et al. (1996)), we have that with probability at least 1 δ E{I{u(x)Tµu,j du,j}} 1 i=1 I{u(xi)Tµu,j du,j} 32(log 8S(A, s) + log |Y| Mazuelas, Romero, and Gr unwald E{I{u(x)Tµl,j dl,j}} 1 i=1 I{u(xi)Tµl,j dl,j} 32(log 8S(A, s) + log |Y| for all j Y. Therefore, using the union bound and the fact that S(A, s) 2(s 1)(m+1)|Y| + 2 4s(m+1)|Y| (see e.g., Corollary 13.1 in Devroye et al. (1996)) we get that with probability at least 1 δ over the randomness of Xs, the sets Xu and Xl have probability larger than 1 εs. For the last step of the proof, let R|Z(h) denote the risk of rule h restricted to Z X, that is, x Z,y Y ℓ(h, (x, y))dp |Z(x, y) for p |Z the probability measure corresponding to p restricted to Z, that is p |Z(x, y) = p (Z) if x Z 0 otherwise. Then, with probability at least 1 δ, we have that R|Xl(hs) εs R(hs) R|Xu(hs) + εs (53) because for any rule h and set Z X x X,y Y ℓ(h, (x, y))dp (x, y) x Z,y Y ℓ(h, (x, y))dp |Z(x, y) + Z x X\Z,y Y ℓ(h, (x, y))dp (x, y) = R|Z(h) (1 p (Z))R|Z(h) + Z x X\Z,y Y ℓ(h, (x, y))dp (x, y) R(h) R|Z(h) + (1 p (Z))(1 R|Z(h)) R|Z(h) + (1 p (Z)) R(h) R|Z(h) (1 p (Z)) because 0 ℓ(h, (x, y)) 1 for any x and y. Taking U1 = {p (Xu Y) : |EpΦ τ| λ + 2Cεs1} U2 = {p (Xl Y) : |EpΦ τ| λ + 2Cεs1} we have that p U implies that p |Xu U1 and p |Xl U2 with probability at least 1 δ. Then, using Theorem 5 we have that with probability at least 1 2δ R|Xu(hs) sup p U1 ℓ(hs, p) inf µ 1 τ Tµ + sup x Xu,y Y {Φ(x, y)Tµ hs(y|x)} + λT|µ| + 2Cεs µ 1 1 τ Tµu + sup x Xu,y Y {Φ(x, y)Tµu hs(y|x)} + λT|µu| + 2Cεs µu 1 = 1 τ Tµu + max x Xs,y Y{Φ(x, y)Tµu hs(y|x)} + λT|µu| + 2Cεs µu 1 = Rs(U) + εs2C µu 1 (54) Minimax Risk Classifiers with 0 -1 Loss using the definition of Xu and Corollary 6. In addition, we have that with probability at least 1 2δ R|Xl(hs) inf p U2 ℓ(hs, p) sup µ 1 τ Tµ + inf x Xl,y Y{Φ(x, y)Tµ hs(y|x)} λT|µ| 2Cεs µ 1 1 τ Tµl + inf x Xl,y Y{Φ(x, y)Tµl hs(y|x)} λT|µl| 2Cεs µl 1 = 1 τ Tµl + min x Xs,y Y{Φ(x, y)Tµl hs(y|x)} λT|µl| 2Cεs µl 1 = Rs(U) εs2C µl 1 (55) using the definition of Xl. Therefore, the result is obtained since inequalities (53), (54), (55) are simultaneously satisfied with probability at least 1 2δ. Appendix I. Proof of Theorem 10 Proof Let α, G, and H be given as in Algorithm 2. If g(µ) is the subgradient given by (32), we have that F (µ cg(µ)) + b = F µ c a + λ sign(µ) + coli(µ)(F) + b = Fµ + b c α + Hsign(µ ) + coli(µ)(G) c H sign(µ) sign(µ ) for any µ Rm and c R. Then, using induction it is straightforward to show that in Algorithm 2 we have that yk+1 = µk ckg(µk), µk+1 = yk+1 + θk(θ 1 k 1)(yk+1 yk) wk+1 = Fyk+1 + b vk+1 = Fµk+1 + b ik+1 = arg max Fµk+1 + b for all k. Therefore the sequences {µk} generated by Algorithms 1 and 2 are identical. To prove the second claim, note that the computational complexity of Algorithm 1 in each iteration is given by the multiplication Fµk+1+b in step 6, which has a time complexity of O(pm). On the other hand, the computational complexity of Algorithm 2 in each iteration is determined by steps 7-14 which have a time complexity of O(pmγ( k)) where γ( k) denotes the fraction of non-zero components of vector k = sign(µk+1) sign(µk). Mazuelas, Romero, and Gr unwald Charalambos D. Aliprantis and Kim C. Border. Infinite dimensional analysis. Springer Verlag, Berlin, 1994. Ver onica Alvarez, Santiago Mazuelas, and Jose A. Lozano. Minimax classification under concept drift with multidimensional adaptation and performance guarantees. In Proceedings of the 39th International Conference on Machine Learning, pages 486 499, 17 23 Jul 2022. Amiran Ambroladze, Emilio Parrado-Hern andez, and John Shawe-Taylor. Tighter PACBayes bounds. In Advances in Neural Information Processing Systems, volume 19, pages 9 16, 2007. Kaiser Asif, Wei Xing, Sima Behpour, and Brian D. Ziebart. Adversarial cost-sensitive classification. In Conference on Uncertainty in Artificial Intelligence, pages 92 101, 2015. Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(21):1 38, 2017. G okhan Bakir, Thomas Hofmann, Bernhard Sch olkopf, Alexander J. Smola, and Ben Taskar. Predicting structured data. MIT press, 2007. Akshay Balsubramani and Yoav Freund. Optimally combining classifiers using unlabeled data. In Proceedings of The 28th Conference on Learning Theory, volume 40, pages 211 225, Paris, France, 03 06 Jul 2015. Akshay Balsubramani and Yoav Freund. Optimal binary classifier aggregation for general losses. In Advances in Neural Information Processing Systems, volume 29, 2016. Peter L. Bartlett, Michael I. Jordan, and Jon D. Mc Auliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006. Luc B egin, Pascal Germain, Fran cois Laviolette, and Jean-Francis Roy. PAC-Bayesian bounds based on the R enyi divergence. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 435 444, May 2016. Shai Ben-David, Nadav Eiron, and Philip M. Long. On the difficulty of approximately maximizing agreements. J. Comput. Syst. Sci., 66(3):496 514, May 2003. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798 1828, March 2013. Dimitri P. Bertsekas. Convex optimization algorithms. Athena Scientific, Belmont, MA, 2015. Kartheek Bondugula, Ver onica Alvarez, Jose I. Segovia-Mart ın Santiago Mazuelas, and Aritz P erez. MRCpy: A library for minimax risk classifiers. ar Xiv preprint, ar Xiv:2108.01952, 2023. Minimax Risk Classifiers with 0 -1 Loss Jonathan M. Borwein and Qiji J. Zhu. Techniques of variational analysis. Springer, Berlin, 2004. Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004. Peter B uhlmann and Sara van de Geer. Statistics for high-dimensional data. Springer, Heidelberg, 2011. Emmanuel J. Cand es, Michael B. Wakin, and Stephen P. Boyd. Enhancing sparsity by reweighted L1 minimization. Journal of Fourier Analysis and Applications, 14:877 905, 2008. Andrea Caponnetto, Charles A. Micchelli, Massimiliano Pontil, and Yiming Ying. Universal multi-task kernels. Journal of Machine Learning Research, 9(52):1615 1646, 2008. Corinna Cortes, Vitaly Kuznetsov, Mehryar Mohri, and Umar Syed. Structural maxent models. In Proceedings of the 32nd International Conference on Machine Learning, pages 391 399, July 2015. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7(19):551 585, 2006. Marie Devaine, Pierre Gaillard, Yannig Goude, and Gilles Stoltz. Forecasting electricity consumption by aggregating specialized experts - a review of the sequential aggregation of specialized experts, with an application to Slovakian and French country-wide one-dayahead (half-)hourly predictions. Machine Learning, 90(2):231 260, 2013. Luc Devroye, L aszl o Gy orfi, and G abor Lugosi. A probabilistic theory of pattern recognition. Springer Verlag, Berlin, 1996. Dheeru Dua and Casey Graff. UCI Machine Learning Repository, 2017. URL http: //archive.ics.uci.edu/ml. John Duchi and Hongseok Namkoong. Variance-based regularization with convex objectives. Journal of Machine Learning Research, 20(68):1 55, 2019. John Duchi, Peter Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 46(3): 946 969, 2021. Theodoros Evgeniou, Massimiliano Pontil, and Tomaso Poggio. Regularization networks and support vector machines. Advances in computational mathematics, 13(1):1 50, 2000. Farzan Farnia and David Tse. A minimax approach to supervised learning. In Advances in Neural Information Processing Systems, volume 29, pages 4240 4248, 2016. Rizal Fathony, Anqi Liu, Kaiser Asif, and Brian D. Ziebart. Adversarial multiclass classification: A risk minimization perspective. In Advances in Neural Information Processing Systems, volume 29, 2016. Mazuelas, Romero, and Gr unwald Vitaly Feldman, Venkatesan Guruswami, Prasad Raghavendra, and Yi Wu. Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing, 41(6):1558 1590, 2012. Charlie Frogner, Sebastian Claici, Edward Chien, and Justin Solomon. Incorporating unlabeled data into distributionally robust learning. Journal of Machine Learning Research, 22(56):1 46, 2021. Pascal Germain, Alexandre Lacasse, Francois Laviolette, Mario March, and Jean-Francis Roy. Risk bounds for the majority vote: From a PAC-Bayesian analysis to a learning algorithm. Journal of Machine Learning Research, 16(26):787 860, 2015. Peter Gr unwald, Thomas Steinke, and Lydia Zakynthinou. PAC-bayes, MAC-bayes and conditional mutual information: Fast rate bounds that handle general VC classes. In Proceedings of Thirty Fourth Conference on Learning Theory, pages 2217 2247, 2021. Peter D. Gr unwald and A. Philip Dawid. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. The Annals of Statistics, 32(4):1367 1433, 2004. Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman and Hall/CRC, 2019. Daniel Hsu and Sivan Sabato. Loss minimization and parameter estimation with heavy tails. Journal of Machine Learning Research, 17(18):1 40, April 2016. John Langford. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6(10):273 306, 2005. Guy Lebanon and John Lafferty. Boosting and maximum likelihood for exponential models. In Advances in Neural Information Processing Systems, volume 14, pages 447 454, 2001. Jaeho Lee and Maxim Raginsky. Minimax statistical learning with Wasserstein distances. In Advances in Neural Information Processing Systems, volume 31, pages 2692 2701, 2018. David G. Luenberger. Optimization by vector space methods. John Wiley & Sons, New York, 1997. G abor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions: A survey. Foundations of Computational Mathematics, 19(5):1145 1190, October 2019. Andreas Maurer and Massimiliano Pontil. Empirical Bernstein bounds and sample variance penalization. In Proceedings of the 22nd Annual Conference on Learning Theory, pages 73 83, 2009. Santiago Mazuelas and Aritz Perez. General supervision via probabilistic transformations. In 24th European Conference on Artificial Intelligence-ECAI 2020, pages 1348 1354, August 2020. Minimax Risk Classifiers with 0 -1 Loss Santiago Mazuelas, Andrea Zanoni, and Aritz P erez. Minimax classification with 0-1 loss and performance guarantees. In Advances in Neural Information Processing Systems, volume 33, pages 302 312, 2020. Santiago Mazuelas, Yuan Shen, and Aritz P erez. Generalized maximum entropy for supervised classification. IEEE Trans. Inf. Theory, 68(4):2530 2550, April 2022. Zakaria Mhammedi, Peter Gr unwald, and Benjamin Guedj. PAC-Bayes un-expected Bernstein inequality. In Advances in Neural Information Processing Systems, volume 32, pages 12202 12213, 2019. Charles A. Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of Machine Learning Research, 7(95):2651 2667, 2006. Mehryar Mohri and Andres Munoz Medina. New analysis and algorithm for learning with drifting distributions. In International Conference on Algorithmic Learning Theory, pages 124 138, 2012. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, Cambridge, MA, second edition, 2018. Christine De Mol, Ernesto De Vito, and Lorenzo Rosasco. Elastic-net regularization in learning theory. Journal of Complexity, 25(2):201 230, 2009. Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Sch olkopf. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2):1 141, 2017. Yurii Nesterov. A method of solving a convex programming problem with convergence rate O 1 k2 . Soviet Math. Doklady, 269(3):543 547, 1983. Yurii Nesterov. Subgradient methods for huge-scale optimization problems. Mathematical Programming, pages 275 297, 2014. Yurii Nesterov and Vladimir Shikhman. Quasi-monotone subgradient methods for nonsmooth convex minimization. Journal of Optimization Theory and Applications, 165(3):917 940, June 2015. Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, volume 20, 2008. Soroosh Shafieezadeh-Abadeh, Peyman Mohajerin Esfahani, and Daniel Kuhn. Distributionally robust logistic regression. In Advances in Neural Information Processing Systems, volume 28, pages 1576 1584, 2015. Soroosh Shafieezadeh-Abadeh, Daniel Kuhn, and Peyman Mohajerin Esfahani. Regularization via mass transportation. Journal of Machine Learning Research, 20(103):1 68, 2019. Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University press, New York, 2014. Mazuelas, Romero, and Gr unwald Matthew Staib and Stefanie Jegelka. Distributionally robust optimization and generalization in kernel methods. In Advances in Neural Information Processing Systems, volume 32, pages 9134 9144, 2019. Ingo Steinwart. Consistency of support vector machines and other regularized kernel classifiers. IEEE Transactions on Information Theory, 51(1):128 142, 2005. Wei Tao, Zhisong Pan, Gaowei Wu, and Qing Tao. The strength of Nesterov s extrapolation in the individual convergence of nonsmooth optimization. IEEE Transactions on Neural Networks and Learning Systems, 31(7):2557 2568, 2020. Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8(36):1007 1025, 2007. Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6(50):1453 1484, 2005. Vladimir Vapnik. Statistical learning theory. Wiley, New York, 1998. Martin J. Wainwright. High-dimensional statistics: a non-asymptotic viewpoint. Cambridge University Press, Cambridge, United Kingdom, 2019. Ian Waudby-Smith and Aaditya Ramdas. Estimating means of bounded random variables by betting. Journal of the Royal Statistical Society Series B: Statistical Methodology, February 2023. Tianbao Yang and Qihang Lin. RSG: Beating subgradient method without smoothness and strong convexity. Journal of Machine Learning Research, (19):1 33, 2018. Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32(1):56 85, February 2004. Hui Zou. The adaptive Lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418 1429, 2006.