# discrete_rényi_classifiers__5e5cbdcb.pdf

Discrete R enyi Classiﬁers

Meisam Razaviyayn meisamr@stanford.edu Farzan Farnia farnia@stanford.edu David Tse dntse@stanford.edu

Consider the binary classiﬁcation problem of predicting a target variable Y from a discrete feature vector X = (X1, . . . , Xd). When the probability distribution P(X, Y ) is known, the optimal classiﬁer, leading to the minimum misclassiﬁcation rate, is given by the Maximum A-posteriori Probability (MAP) decision rule. However, in practice, estimating the complete joint distribution P(X, Y ) is computationally and statistically impossible for large values of d. Therefore, an alternative approach is to ﬁrst estimate some low order marginals of the joint probability distribution P(X, Y ) and then design the classiﬁer based on the estimated low order marginals. This approach is also helpful when the complete training data instances are not available due to privacy concerns. In this work, we consider the problem of ﬁnding the optimum classiﬁer based on some estimated low order marginals of (X, Y ). We prove that for a given set of marginals, the minimum Hirschfeld-Gebelein-R enyi (HGR) correlation principle introduced in [1] leads to a randomized classiﬁcation rule which is shown to have a misclassiﬁcation rate no larger than twice the misclassiﬁcation rate of the optimal classiﬁer. Then, under a separability condition, it is shown that the proposed algorithm is equivalent to a randomized linear regression approach. In addition, this method naturally results in a robust feature selection method selecting a subset of features having the maximum worst case HGR correlation with the target variable. Our theoretical upper-bound is similar to the recent Discrete Chebyshev Classiﬁer (DCC) approach [2], while the proposed algorithm has signiﬁcant computational advantages since it only requires solving a least square optimization problem. Finally, we numerically compare our proposed algorithm with the DCC classiﬁer and show that the proposed algorithm results in better misclassiﬁcation rate over various UCI data repository datasets.

1 Introduction

Statistical classiﬁcation, a core task in many modern data processing and prediction problems, is the problem of predicting labels for a given feature vector based on a set of training data instances containing feature vectors and their corresponding labels. From a probabilistic point of view, this problem can be formulated as follows: given data samples (X1, Y 1), . . . , (Xn, Y n) from a probability distribution P(X, Y ), predict the target label ytest for a given test point X = xtest.

Many modern classiﬁcation problems are on high dimensional categorical features. For example, in the genome-wide association studies (GWAS), the classiﬁcation task is to predict a trait of interest based on observations of the SNPs in the genome. In this problem, the feature vector X = (X1, . . . , Xd) is categorical with Xi {0, 1, 2}.

What is the optimal classiﬁer leading to the minimum misclassiﬁcation rate for such a classiﬁcation problem with high dimensional categorical feature vectors? When the joint probability distribution of the random vector (X, Y ) is known, the MAP decision rule deﬁned by δMAP argmaxy P(Y =

Department of Electrical Engineering, Stanford University, Stanford, CA 94305.

y|X = x) achieves the minimum misclassiﬁcation rate. However, in practice the joint probability distribution P(X, Y ) is not known. Moreover, estimating the complete joint probability distribution is not possible due to the curse of dimensionality. For example, in the above GWAS problem, the dimension of the feature vector X is d 3, 000, 000 which leads to the alphabet size of 33,000,000 for the feature vector X. Hence, a practical approach is to ﬁrst estimate some low order marginals of P(X, Y ), and then use these low order marginals to build a classiﬁer with low misclassiﬁcation rate. This approach, which is the sprit of various machine learning and statistical methods [2 6], is also useful when the complete data instances are not available due to privacy concerns in applications such as medical informatics.

In this work, we consider the above problem of building a classiﬁer for a given set of low order marginals. First, we formally state the problem of ﬁnding the robust classiﬁer with the minimum worst case misclassiﬁcation rate. Our goal is to ﬁnd a (possibly randomized) decision rule which has the minimum worst case misclassiﬁcation rate over all probability distributions satisfying the given low order marginals. Then a surrogate objective function, which is obtained by the minimum HGR correlation principle [1], is used to propose a randomized classiﬁcation rule. The proposed classiﬁcation method has the worst case misclassiﬁcation rate no more than twice the misclassiﬁcation rate of the optimal classiﬁer. When only pairwise marginals are estimated, it is shown that this classiﬁer is indeed a randomized linear regression classiﬁer on indicator variables under a separability condition. Then, we formulate a feature selection problem based on the knowledge of pairwise marginals which leads to the minimum misclassiﬁcation rate. Our analysis provides a theoretical justiﬁcation for using group lasso objective function for feature selection over the discrete set of features. Finally, we conclude by presenting numerical experiments comparing the proposed classiﬁer with discrete Chebyshev classiﬁer [2], Tree Augmented Naive Bayes [3], and Minimax Probabilistic Machine [4]. In short, the contributions of this work is as follows.

Providing a rigorous theoretical justiﬁcation for using the minimum HGR correlation principle for binary classiﬁcation problem.

Proposing a randomized classiﬁer with misclassiﬁcation rate no larger than twice the misclassiﬁcation rate of the optimal classiﬁer.

Introducing a computationally efﬁcient method for calculating the proposed randomized classiﬁer when pairwise marginals are estimated and a separability condition is satisﬁed.

Providing a mathematical justiﬁcation based on maximal correlation for using group lasso problem for feature selection in categorical data.

Related Work: The idea of learning structures in data through low order marginals/moments is popular in machine learning and statistics. For example, the maximum entropy principle [7], which is the spirit of the variational method in graphical models [5] and tree augmented naive Bayes [3], is based on the idea of ﬁxing the marginal distributions and ﬁtting a probabilistic model which maximizes the Shannon entropy. Although these methods ﬁt a probabilistic model satisfying the low order marginals, they do not directly optimize the misclassiﬁcation rate of the resulting classiﬁer.

Another related information theoretic approach is the minimum mutual information principle [8] which ﬁnds the probability distribution with the minimum mutual information between the feature vector and the target variable. This approach is closely related to the framework of this paper; however, unlike the minimum HGR principle, there is no known computationally efﬁcient approach for calculating the probability distribution with the minimum mutual information.

In the continuous setting, the idea of minimizing the worst case misclassiﬁcation rate leads to the minimax probability machine [4]. This algorithm and its analysis is not easily extendible to the discrete scenario.

The most related algorithm to this work is the recent Discrete Chebyshev Classiﬁer (DCC) algorithm [2]. The DCC is based on the minimization of the worst case misclassiﬁcation rate over the class of probability distributions with the given marginals of the form (Xi, Xj, Y ). Similar to our framework, the DCC method achieves the misclassiﬁcation rate no larger than twice the misclassiﬁcation rate of the optimum classiﬁer. However, computation of the DCC classiﬁer requires solving a non-separable non-smooth optimization problem which is computationally demanding, while the proposed algorithm results in a least squares optimization problem with a closed form solution. Furthermore, in contrast to [2] which only considers deterministic decision rules, in this work we

consider the class of randomized decision rules. Finally, it is worth noting that the algorithm in [2] requires tree structure to be tight, while our proposed algorithm works on non-tree structures as long as the separability condition is satisﬁed.

2 Problem Formulation

Consider the binary classiﬁcation problem with d discrete features X1, X2, . . . , Xd X and a target variable Y Y {0, 1}. Without loss of generality, let us assume that X {1, 2, . . . , m} and the data points (X, Y ) are coming from an underlying probability distribution PX,Y (x, y). If the joint probability distribution P(x, y) is known, the optimal classiﬁer is given by the maximum a posteriori probability (MAP) estimator, i.e., by MAP(x) argmaxy {0,1} P(Y = y | X = x). However, the joint probability distribution P(x, y) is often not known in practice. Therefore, in order to utilize the MAP rule, one should ﬁrst estimate P(x, y) using the training data instances. Unfortunately, estimating the joint probability distribution requires estimating the value of P(X = x, Y = y) for all (x, y) X d Y which is intractable for large values of d. Therefore, as mentioned earlier, our approach is to ﬁrst estimate some low order marginals of the joint probability distribution P( ); and then utilize the minimax criterion for classiﬁcation.

Let C be the class of probability distributions satisfying the estimated marginals. For example, when only pairwise marginals of the ground-truth distribution P is estimated, the set C is the class of distributions satisfying the given pairwise marginals, i.e.,

Cpairwise PX,Y ( , ) PXi,Xj(xi, xj) = PXi,Xj(xi, xj), PXi,Y (xi, y) = PXi,Y (xi, y),

xi, xj X, y Y, i, j . (1)

In general, C could be any class of probability distributions satisfying a set of estimated low order marginals.

Let us also deﬁne δ to be a randomized classiﬁcation rule with

( 0 with probability qx δ

1 with probability 1 qx δ ,

for some qx δ [0, 1], x X d. Given a randomized decision rule δ and a joint probability distribution PX,Y (x, y), we can extend P( ) to include our randomized decision rule. Then the misclassiﬁcation rate of the decision rule δ, under the probability distribution P( ), is given by P(δ(X) = Y ). Hence, under minimax criterion, we are looking for a decision rule δ which minimizes the worst case misclassiﬁcation rate. In other words, the robust decision rule is given by

δ argmin δ D max P C P (δ(X) = Y ) , (2)

where D is the set of all randomized decision rules. Notice that the optimal decision rule δ may not be unique in general.

3 Worst Case Error Minimization

In this section, we propose a surrogate objective for (2) which leads to a decision rule with misclassiﬁcation rate no larger than twice of the optimal decision rule δ . Later we show that the proposed surrogate objective is connected to the minimum HGR principle [1].

Let us start by rewriting (2) as an optimization problem over real valued variables. Notice that each probability distribution PX,Y ( , ) can be represented by a probability vector p = [px,y | (x, y) X d Y] R2md with px,y = P(X = x, Y = y) and P

x,y px,y = 1. Similarly, every randomized

rule δ can be represented by a vector qδ = [qx δ | x X d] Rmd. Adopting these notations, the set C can be rewritten in terms of the probability vector p as

C p Ap = b, 1T p = 1, p 0 ,

where the system of linear equations Ap = b represents all the low order marginal constraints in B; and the notation 1 denotes the vector of all ones. Therefore, problem (2) can be reformulated as

q δ argmin 0 qδ 1 max p C

x (qx δ px,1 + (1 qx δ )px,0) , (3)

where px,0 and px,1 denote the elements of the vector p corresponding to the probability values P(X = x, Y = 0) and P(X = x, Y = 1), respectively. The simple application of the minimax theorem [9] implies that the saddle point of the above optimization problem exists and moreover, the optimal decision rule is a MAP rule for a certain probability distribution P C. In other words, there exists a pair (δ , P ) for which

P(δ (X) = Y ) P (δ (X) = Y ), P C and P (δ(X) = Y ) P (δ (X) = Y ), δ D.

Although the above observation characterizes the optimal decision rule to some extent, it does not provide a computationally efﬁcient approach for ﬁnding the optimal decision rule. Notice that it is NP-hard to verify the existence of a probability distribution satisfying a given set of low order marginals [10]. Based on this observation and the result in [11], we conjecture that in general, solving (2) is NP-hard in the number variables and the alphabet size even when the set C is nonempty. Hence, here we focus on developing a framework to ﬁnd an approximate solution of (2).

Let us continue by utilizing the minimax theorem [9] and obtain the worst case probability distribution in (3) by p argmaxp C min0 qδ 1 P

x (qx δ px,1 + (1 qx δ )px,0) , or equivalently,

p argmax p C

x min {px,0 , px,1} . (4)

Despite convexity of the above problem, there are two sources of hardness which make the problem intractable for moderate and large values of d. Firstly, the objective function is non-smooth. Secondly, the number of optimization variables is 2md and grows exponentially with the alphabet size. To deal with the ﬁrst issue, notice that the function inside the summation is the max-min fairness objective between the two quantities px,1 and px,0. Replacing this objective with the harmonic average leads to the following smooth convex optimization problem:

ep argmax p C

px,1px,0 px,1 + px,0 . (5)

It is worth noting that the harmonic mean of the two quantities is intuitively a reasonable surrogate for the original objective function since px,1px,0 px,1 + px,0 min {px,0 , px,1} 2px,1px,0

px,1 + px,0 . (6)

Although this inequality suggests that the objective functions in (5) and (4) are close to each other, it is not clear whether the distribution ep leads to any classiﬁcation rule having low misclassiﬁcation rate for all distributions in C. In order to obtain a classiﬁcation rule from ep, the ﬁrst naive approach is to use MAP decision rule based on ep. However, the following result shows that this decision rule does not achieve the factor two misclassiﬁcation rate obtained in [2].

Theorem 1 Let us deﬁne eδmap(x) argmaxy Y epx,y with the worst case error probability

ee map max P C P eδ map(X) = Y . Then, e ee map 4e , where e is the worst case

misclassiﬁcation rate of the optimal decision rule δ , that is, e max P C P (δ (X) = Y ) .

Proof The proof is similar to the proof of next theorem and hence omitted here.

Next we show that, surprisingly, one can obtain a randomized decision rule based on the solution of (5) which has a misclassiﬁcation rate no larger than twice of the optimal decision rule δ .

Given ep as the optimal solution of (5), deﬁne the random decision rule eδ as

0 with probability ep 2 x,0 ep 2 x,0+ep 2 x,1 1 with probability ep 2 x,1 ep 2 x,0+ep 2 x,1

Let e be the worst case classiﬁcation error of the decision rule eδ, i.e.,

ee max P C P eδ(X) = Y .

Clearly, e ee according to the deﬁnition of the optimal decision rule e . The following theorem shows that ee is also upper-bounded by twice of the optimal misclassiﬁcation rate e .

Theorem 2 Deﬁne θ max p C

px,1px,0 px,1 + px,0 (8)

Then, θ ee 2θ 2e . In other words, the worst case misclassiﬁcation rate of the decision rule eδ is at most twice the optimal decision rule δ .

Proof The proof is relegated to the supplementary materials.

So far, we have resolved the non-smoothness issue in solving (4) by using a surrogate objective function. In the next section, we resolve the second issue by establishing the connection between problem (5) and the minimum HGR correlation principle [1]. Then, we use the existing result in [1] to develop a computationally efﬁcient approach for calculating the decision rule δ( ) for Cpairwise.

4 Connection to Hirschfeld-Gebelein-R enyi Correlation

A commonplace approach to infer models from data is to employ the maximum entropy principle [7]. This principle states that, given a set of constraints on the ground-truth distribution, the distribution with the maximum (Shannon) entropy under those constraints is a proper representer of the class. To extend this rule to the classiﬁcation problem, the authors in [8] suggest to pick the distribution maximizing the target entropy conditioned to features, or equivalently minimizing mutual information between target and features. Unfortunately, this approach does not lead to a computationally efﬁcient approach for model ﬁtting and there is no guarantee on the misclassiﬁcation rate of the resulting classiﬁer. Here we study an alternative approach of minimum HGR correlation principle [1]. This principle suggests to pick the distribution in C minimizing HGR correlation between the target variable and features. The HGR correlation coefﬁcient between the two random objects X and Y , which was ﬁrst introduced by Hirschfeld and Gebelein [12, 13] and then studied by R enyi [14], is deﬁned as ρ(X, Y ) supf,g E [f(X)g(Y )] , where the maximization is taken over the class of all measurable functions f( ) and g( ) with E[f(X)] = E[g(Y )] = 0 and E[f 2(X)] = E[g2(Y )] = 1. The HGR correlation coefﬁcient has many desirable properties. For example, it is normalized to be between 0 and 1. Furthermore, this coefﬁcient is zero if and only if the two random variables are independent; and it is one if there is a strict dependence between X and Y . For other properties of the HGR correlation coefﬁcient see [14,15] and the references therein.

Lemma 1 Assume the random variable Y is binary and deﬁne q P(Y = 0). Then,

PX,Y (x, 0)PX,Y (x, 1)

PX,Y (x, 0) + PX,Y (x, 1)

Proof The proof is relegated to the supplementary material.

This lemma leads to the following observation.

Observation: Assume the marginal distribution P(Y = 0) and P(Y = 1) is ﬁxed for any distribution P C. Then, the distribution in C with the minimum HGR correlation between X and Y is the distribution e P obtained by solving (5). In other words, ρ(X, Y ; e P) ρ(X, Y ; P), P C, where ρ(X, Y ; P) denotes the HGR correlation coefﬁcient under the probability distribution P.

Based on the above observation, from now on, we call the classiﬁer eδ( ) in (7) as the R enyi classiﬁer . In the next section, we use the result of the recent work [1] to compute the R enyi classiﬁer eδ( ) for a special class of marginals C = Cpairwise.

5 Computing R enyi Classiﬁer Based on Pairwise Marginals

In many practical problems, the number of features d is large and therefore, it is only computationally tractable to estimate marginals of order at most two. Hence, hereafter, we restrict ourselves to the case where only the ﬁrst and second order marginals of the distribution P is estimated, i.e., C = Cpairwise. In this scenario, in order to predict the output of the R enyi classiﬁer for a given data point x, one needs to ﬁnd the value of epx,0 and epx,1. Next, we state a result from [1] which sheds light on the computation of epx,0 and epx,1. To state the theorem, we need the following deﬁnitions:

Let the matrix Q Rdm dm and the vector d Rdm 1 be deﬁned through their entries as

Qmi+k,mj+ℓ= P(Xi+1 = k, Xj+1 = ℓ), dmi+k = P(Xi+1 = k, Y = 1) P(Xi+1 = k, Y = 0),

for every i, j = 0, . . . , d 1 and k, ℓ= 1, . . . , m. Also deﬁne the function h(z) : Rmd 1 7 R as h(z) Pd i=1 max{zmi m+1, zmi m+2, . . . , zmi}. Then, we have the following theorem.

Theorem 3 (Rephrased from [1]) Assume Cpairwise = . Let

γ min z Rmd 1 z T Qz d T z + 1

1 γ q(1 q) min P Cpairwise ρ(X, Y ; P), where the inequality holds with equality if and

only if there exists a solution z to (9) such that h(z ) 1

2 and h( z ) 1

2; or equivalently, if and only if the following separability condition is satisﬁed for some P Cpairwise.

EP[Y |X = x] =

i=1 ζi(xi), x X d, (10)

for some functions ζ1, . . . , ζd. Moreover, if the separability condition holds with equality, then

e P(Y = y X = (x1, . . . , xd)) = 1

2 ( 1)y d X

i=1 z (i 1)m+xi. (11)

Combining the above theorem with the equality

e P2(Y = 0, X = x) e P2(Y = 0, X = x) + e P2(Y = 1, X = x) = e P2(Y = 0 | X = x) e P2(Y = 0 | X = x) + e P2(Y = 1 | X = x)

implies that the decision rule eδ and eδ map can be computed in a computationally efﬁcient manner under the separability condition. Notice that when the separability condition is not satisﬁed, the approach proposed in this section would provide a classiﬁcation rule whose error rate is still bounded by 2γ. However, this error rate does no longer provide a 2-factor approximation gap. It is also worth mentioning that the separability condition is a property of the class of distribution Cpairwise and is independent of the classiﬁer at hand. Moreover, this condition is satisﬁed with a positive measure over the simplex of the all probability distributions, as discussed in [1]. Two remarks are in order:

Inexact knowledge of marginal distribution: The optimization problem (9) is equivalent to solving the stochastic optimization problem

z = argmin z E h WT z C 2i ,

where W {0, 1}md 1 is a random vector with Wm(i 1)+k = 1 if Xi = k in the and Wm(i 1)+k = 0, otherwise. Also deﬁne the random variable C { 1

2} with C = 1

2 if the random variable Y = 1 and C = 1

2, otherwise. Here the expectation could be calculated with respect to any distribution in C. Hence, in practice, the above optimization problem can be estimated using Sample Average Approximation (SAA) method [16,17] through the optimization problem

bz = argmin z 1 n

(wi)T z ci 2 ,

where (wi, ci) corresponds to the i-th training data point (xi, yi). Clearly, this is a least square problem with a closed form solution. Notice that in order to bound the SAA error and avoid overﬁtting, one could restrict the search space for bz [18]. This could also be done using regularizers such as ridge regression by solving

bz ridge = argmin z 1 n

(wi)T z ci 2 + λridge z 2 2.

Beyond pairwise marginals: When d is small, one might be interested in estimating higher order marginals for predicting Y . In this scenario, a simple modiﬁcation for the algorithm is to deﬁne the new set of feature random variables n e Xij = (Xi, Xj) | i = j o ; and apply the algorithm to the new set of feature variables. It is not hard to see that this approach utilizes the marginal information P(Xi, Xj, Xk, Xℓ) and P(Xi, Xj, Y ).

6 Robust R enyi Feature Selection

The task of feature selection for classiﬁcation purposes is to preselect a subset of features for use in model ﬁtting in prediction. Shannon mutual information, which is a measure of dependence between two random variables, is used in many recent works as an objective for feature selection [19, 20]. In these works, the idea is to select a small subset of features with maximum dependence with the target variable Y . In other words, the task is to ﬁnd a subset of variables S {1, . . . , d} with |S| k based on the following optimization problem

SMI argmax S {1,...,d} I(XS; Y ), (12)

where XS (Xi)i S and I (XS; Y ) denotes the mutual information between the random variable XS and Y . Almost all of the existing approaches for solving (12) are based on heuristic approaches and of greedy nature which aim to ﬁnd a sub-optimal solution of (12). Here, we suggest to replace mutual information with the maximal correlation. Furthermore, since estimating the joint distribution of X and Y is computationally and statistically impossible for large number of features d, we suggest to estimate some low order marginals of the groundtruth distribution P(X, Y ) and then solve the following robust R enyi feature selection problem:

SRFS argmax S {1,...,d} min P C ρ(XS, Y ; P). (13)

When only pairwise marginals are estimated from the training data, i.e., C = Cpairwise, maximizing the lower-bound q

1 γ q(1 q) instead of (13) leads to the following optimization problem

b S RFS argmax |S| k

1 1 q(1 q) min z ZS z T Qz d T z + 1

or equivalently, b S RFS argmin |S| k min z ZS z T Qz d T z,

where ZS z Rmd Pm k=1 |zmi m+k| = 0, i / S . This problem is of combinatorial nature. Howevre, using the standard group Lasso regularizer leads to the feature selection procedure in Algorithm 1.

Algorithm 1 Robust R enyi Feature Selection

Choose a regularization parameter λ > 0 and deﬁne h(z) Pd i=1 max{zmi m+1, . . . , zmi}. Let bz RFS argminz z T Qz d T z + λh(|z|).

Set S = {i | Pm k=1 |z RFS mi m+k| > 0}.

Notice that, when the pairwise marginals are estimated from a set of training data points, the above feature selection procedure is equivalent to applying the group Lasso regularizer to the standard linear regression problem over the domain of indicator variables. Our framework provides a justiﬁcation for this approach based on the robust maximal correlation feature selection problem (13).

Remark 1 Another natural approach to deﬁne the feature selection procedure is to select a subset of features S by minimizing the worst case classiﬁcation error, i.e., solving the following optimization problem

min |S| k min δ DS max P C P(δ(X) = Y ), (14)

where DS is the set of randomized decision rules which only uses the feature variables in S. Deﬁne F(S) minδ DS max P C P(δ(X) = Y ). It can be shown that F(S) min|S| k minz ZS z T Qz d T z + 1

4. Therefore, another justiﬁcation for Algorithm 1 is to minimize an upper-bound of F(S) instead of itself.

Remark 2 Alternating Direction Method of Multipliers (ADMM) algorithm [21] can be used for solving the optimization problem in Algorithm 1; see the supplementary material for more details.

7 Numerical Results

We evaluated the performance of the R enyi classiﬁers eδ and eδ map on ﬁve different binary classiﬁcation datasets from the UCI machine learning data repository. The results are compared with ﬁve different benchmarks used in [2]: Discrete Chebyshev Classiﬁer [2], greedy DCC [2], Tree Augmented Naive Bayes [3], Minimax Probabilistic Machine [4], and support vector machines (SVM). In addition to the classiﬁers eδ and eδ map which only use pairwise marginals, we also use higher order marginals in eδ2 and eδ map 2 . These classiﬁers are obtained by deﬁning the new feature variables { e Xij = (Xi, Xj)} as discussed in section 5. Since in this scenario, the number of features is large, we combine our R enyi classiﬁer with the proposed group lasso feature selection. In other words, we ﬁrst select a subset of { e Xij} and then ﬁnd the maximum correlation classiﬁer for the selected features. The value of λridge and λ is determined through cross validation. The results are averaged over 100 Monte Carlo runs each using 70% of the data for training and the rest for testing. The results are summarized in the table below where each number shows the percentage of the error of each method. The boldface numbers denote the best performance on each dataset.

As can be seen in this table, in four of the tested datasets, at least one of the proposed methods outperforms the other benchmarks. Furthermore, it can be seen that the classiﬁer eδmap on average performs better than δ. This fact could be due to the speciﬁc properties of the underlying probability distribution in each dataset.

Datasets eδmap eδ eδmap 2 eδ2 eδmap FS,2 eδFS,2 DCC g DCC MPM TAN SVM adult 17 21 16 20 16 20 18 18 22 18 22 credit 13 16 16 17 16 17 14 13 13 17 16 kr-vs-kp 5 10 5 14 5 14 10 10 5 7 3 promoters 6 16 3 4 3 4 5 3 6 44 9 votes 3 4 3 4 2 4 3 3 4 8 5

In order to evaluate the computational efﬁciency of the R enyi classiﬁer, we compare its running time with SVM over the synthetic data set with d = 10, 000 features and n = 200 data points. Each feature Xi is generated by i.i.d. Bernoulli distribution with P(Xi = 1) = 0.7. The target variable y is generated by y = sign(αT X + n) with n N(0, 1); and α Rd is generated with 30% nonzero elements each drawn from standard Gaussian distribution N(0, 1). The results are averaged over 1000 Monte-Carlo runs of generating the data set and use 85% of the data points for training and 15% for test. The R enyi classiﬁer is obtained by gradient descent method with regularizer λridge = 104. The numerical experiment shows 19.7% average misclassiﬁcation rate for SVM and 19.9% for R enyi classiﬁer. However, the average training time of the R enyi classiﬁer is 0.2 seconds while the training time of SVM (with Matlab SVM command) is 1.25 seconds.

Acknowledgments: The authors are grateful to Stanford University supporting a Stanford Graduate Fellowship, and the Center for Science of Information (CSo I), an NSF Science and Technology Center under grant agreement CCF-0939370 , for the support during this research.

[1] F. Farnia, M. Razaviyayn, S. Kannan, and D. Tse. Minimum HGR correlation principle: From marginals to joint distribution. ar Xiv preprint ar Xiv:1504.06010, 2015.

[2] E. Eban, E. Mezuman, and A. Globerson. Discrete chebyshev classiﬁers. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1233 1241, 2014.

[3] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classiﬁers. Machine learning, 29(2-3):131 163, 1997.

[4] G. R. G. Lanckriet and E. L. Ghaoui, C. Bhattacharyya, and M. I. Jordan. A robust minimax approach to classiﬁcation. The Journal of Machine Learning Research, 3:555 582, 2003.

[5] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183 233, 1999.

[6] T. Roughgarden and M. Kearns. Marginals-to-models reducibility. In Advances in Neural Information Processing Systems, pages 1043 1051, 2013.

[7] E. T. Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620, 1957.

[8] A. Globerson and N. Tishby. The minimum information principle for discriminative learning. In Proceedings of the 20th conference on Uncertainty in artiﬁcial intelligence, pages 193 200. AUAI Press, 2004.

[9] M. Sion. On general minimax theorems. Paciﬁc J. Math, 8(1):171 176, 1958. [10] J. De Loera and S. Onn. The complexity of three-way statistical tables. SIAM Journal on Computing, 33(4):819 836, 2004.

[11] D. Bertsimas and J. Sethuraman. Moment problems and semideﬁnite optimization. In Handbook of semideﬁnite programming, pages 469 509. Springer, 2000.

[12] H. O. Hirschfeld. A connection between correlation and contingency. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 31, pages 520 524. Cambridge Univ. Press, 1935.

[13] H. Gebelein. Das statistische problem der korrelation als variations-und eigenwertproblem und sein zusammenhang mit der ausgleichsrechnung. ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift f ur Angewandte Mathematik und Mechanik, 21(6):364 379, 1941.

[14] A. R enyi. On measures of dependence. Acta mathematica hungarica, 10(3):441 451, 1959.

[15] V. Anantharam, A. Gohari, S. Kamath, and C. Nair. On maximal correlation, hypercontractivity, and the data processing inequality studied by Erkip and Cover. ar Xiv preprint ar Xiv:1304.6133, 2013.

[16] A. Shapiro, D. Dentcheva, and A. Ruszczy nski. Lectures on stochastic programming: modeling and theory, volume 16. SIAM, 2014.

[17] A. Shapiro. Monte carlo sampling methods. Handbooks in operations research and management science, 10:353 425, 2003.

[18] S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pages 793 800, 2009.

[19] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information criteria of maxdependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1226 1238, 2005.

[20] R. Battiti. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5(4):537 550, 1994.

[21] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning, 3(1):1 122, 2011.