# multimodal_linear_discriminant_analysis_via_structural_sparsity__cda78466.pdf

Multimodal Linear Discriminant Analysis via Structural Sparsity

Yu Zhang1 and Yuan Jiang2

1Department of Computer Science and Engineering, Hong Kong University of Science and Technology 2National Key Laboratory for Novel Software Technology, Nanjing University zhangyu@cse.ust.hk, jiangy@lamda.nju.edu.cn

Linear discriminant analysis (LDA) is a widely used supervised dimensionality reduction technique. Even though the LDA method has many real-world applications, it has some limitations such as the single-modal problem that each class follows a normal distribution. To solve this problem, we propose a method called multimodal linear discriminant analysis (MLDA). By generalizing the between-class and within-class scatter matrices, the MLDA model can allow each data point to have its own class mean which is called the instance-speciﬁc class mean. Then in each class, data points which share the same or similar instance-speciﬁc class means are considered to form one cluster or modal. In order to learn the instance-speciﬁc class means, we use the ratio of the proposed generalized between-class scatter measure over the proposed generalized within-class scatter measure, which encourages the class separability, as a criterion. The observation that each class will have a limited number of clusters inspires us to use a structural sparse regularizor to control the number of unique instance-speciﬁc class means in each class. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed MLDA method.

1 Introduction

In the LDA model, each class is modeled by a Gaussian distribution, which is not capable of handling the multimodal structure contained in the data. The mixture discriminant analysis (MDA) [Hastie and Tibshirani, 1996] is one of the earliest multimodal extensions of the LDA model and it uses a Gaussian mixture model to describe the data in a class with the number of Gaussian components pre-deﬁned and shared among different classes. Different from the MDA model, the LLDA method proposed in [Kim and Kittler, 2005] ﬁrst groups the whole dataset by using the Gaussian mixture model or k-means clustering algorithm and then learns a LDA model on each cluster. The subclass discriminant analysis (SDA) method proposed in [Zhu and Martínez, 2006]

Corresponding Author

aims at learning the number of clusters in each class based on the leave-one-out-test criterion or a stability criterion proposed in [Martínez and Zhu, 2005]. The local Fisher discriminant analysis (LFDA) proposed in [Sugiyama, 2006; Sugiyama, 2007], which have similar ideas to nonparametric discriminant analysis [Kuo and Landgrebe, 2004; Li et al., 2009], conquers the multimodal problem by incorporating the local structure into the deﬁnitions of the within-class and between-class scatter matrices. One by-product of those multimodal LDA models is that the dimension of the embedding space is no longer limited to at most the number of classes minus 1 since the rank of the between-class scatter matrix becomes larger by exploiting the multimodal structure. All the existing multimodal extensions of the LDA model have some limitations. For example, the MDA and LLDA models require that the number of components should be predeﬁned, which is not an easy model selection problem. The assumptions of the SDA method that different classes have the same number of clusters and that clusters in each class have comparable numbers of data points seem a bit restricted. Moreover, the SDA method consists of two stages with the ﬁrst stage learning the cluster structure while the second one learns the transformation, leading to a suboptimal learner. For the LFDA method, it is not easy to infer the cluster structure from data. In this paper, we propose a method called multimodal linear discriminant analysis (MLDA) to alleviate those limitations in existing multimodal extensions of the LDA model. The MLDA model generalizes the deﬁnitions of the between-class and within-class scatter matrices by introducing the instancespeciﬁc class means. That is, different from the traditional LDA model which has only one class mean for each class, in the MLDA method each data point has its own class mean and data points, which share the same or similar class means, in each class are considered to belong to a cluster or modal. We reveal that the conventional scatter matrices are special cases of the generalized ones. Since the instance-speciﬁc class means are unknown for real-world problems, we utilize the ratio of the proposed generalized between-class scatter measure over the proposed generalized within-class scatter measure as an objective function to learn them. Usually each class has a limited number of clusters, implying that the number of unique instance-speciﬁc class means in each class is not very large, and this observation inspires us to use a structurally

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

sparse regularizer (i.e., the ℓ1,p norm) to enforce some pairs of instance-speciﬁc class means to be identical and in the meanwhile to control the number of unique instance-speciﬁc class means in each class. One advantage to use the structurally sparse regularizer is that we need not to specify the number of clusters and not to impose constraints on the cluster structure. We devise a proximal average method based on the GIST algorithm [Gong et al., 2013] to solve the resulting objective function with each subproblem having an analytical solution. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of our proposed method.

2 The MLDA Model

In this section, we ﬁrst review the conventional LDA model and then present the generalized scatter matrices, which set the stage for the introduction of the proposed MLDA model, as well as their properties.

2.1 LDA Revisited

The LDA model is a supervised dimensionality reduction method. Suppose that a training dataset consists of n pairs of training samples {xi, yi}n i=1 where xi RD denotes the ith data point and its class label is denoted by yi {1, . . . , c}, making the learning problem a multi-class classiﬁcation problem with the number of classes being c. Let ni denote the number of data points in the ith class and so n = Pc i=1 ni. Moreover, we deﬁne the overall mean m as the mean of all data points, i.e., m = 1

n Pn i=1 xi, and the class mean for the ith class as mi = 1 ni P yj=i xj. Then the between-class and within-class scatter matrices are deﬁned as

i=1 ni( mi m)( mi m)T , (1)

i=1 (xi myi)(xi myi)T . (2)

Then the LDA model seeks the optimal transformation matrix W by solving the following problem

max W RD d tr WT (Qw + αID)W 1 WT Qb W , (3)

where d is the dimension of the reduced space, Ia denotes an a a identity matrix, the superscript 1 denotes the matrix inverse, and α is a regularization parameter to guarantee the non-singularity.

2.2 Generalized Scatter Matrices

According to problem (3) and the deﬁnition of the withinclass scatter matrix in Eq. (2), the LDA model enforces data points in one class to be close to the class mean, making it applicable to single-modal data. However, the data used in many applications exhibit multimodal structure, leading to unsatisfactory performance by using the LDA model. Here we propose the generalized within-class and between-class scatter

matrices as

j=1 δ(yi = yj)(mi mj)(mi mj)T (4)

i=1 (xi mi)(xi mi)T , (5)

where δ(z) returns 1 when z holds and otherwise 0, and mi represents the instance-speciﬁc class mean for xi. Suppose that the ith class has li unique instance-speciﬁc class means, i.e., the set {mj|yj = i} having li elements {mi k} for k = 1, . . . , li. Then we can rewrite the generalized within-class scatter matrix deﬁned in Eq. (5) as

k Ci,j (xk mi j)(xk mi j)T ,

where Ci,j = {k|mk = mi j} denotes the set of the indices corresponding to the data points belonging to the jth modal (or cluster) of the ith class. By using the generalized scatter matrices, we can easily model the multimodal structure contained in the data.

2.3 Properties The newly deﬁned scatter matrices in Eqs. (4) and (5) are called the generalized between-class and within-class scatter matrices since the conventional scatter matrices are special cases of them, and the relation is revealed in the following theorem. Theorem 1 By setting mi to be mj when xi belongs to the jth class, we have Sb = Qb and Sw = Qw. Recall that the conventional between-class and within-class scatter matrices have one property that Qb + Qw = Qt where Qt = 1 n Pn i=1(xi m)(xi m)T is the total scatter matrix. Similar to that, the following theorem shows that the generalized scatter matrices have a similar property. Theorem 2 For Sb and Sw, we have

Sb + Sw = 1

k Ci,j (xk m)(xk m)T

k Ci,j (xk mi j)(mi j m)T

k Ci,j (mi j m)(xk mi j)T

j=1 ni,j(mi j mi)(mi j mi)T ,

where ni,j is the cardinality of Ci,j implying that the jth cluster of the ith class has ni,j data points, m = 1

n Pn i=1 mi, and mi = 1 ni P yj=i mj. Moreover, when mi j is set to be the average of all the data points in the jth cluster of the ith class, we have

Sb + Sw = Qt

j=1 ni,j(mi j mi)(mi j mi)T . (6)

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

According to Theorem 2, we can see that when {mi} take appropriate values, the sum of the generalized between-class and within-class scatter matrices equals the total scatter matrix minus a matrix called the average between-cluster scatter matrix whose formulation is just the second term in the righthand side of Eq. (6). The average between-cluster scatter matrix measures the separability between different clusters in the same class and is not useful for discriminating different classes.

2.4 The Model When given M = (m1, . . . , mn), we formulate the objective function of the MLDA model in a similar way to the conventional LDA model as

max W RD d tr WT (Sw + αID)W 1 WT Sb W . (7)

A by-product of this formulation is that if the unique instancespeciﬁc class means are linearly independent, the rank of Sb is Pc i=1 li 1 which is larger than c 1 and so the dimension of the reduced space d can be larger than c 1, which overcomes a limitation of the conventional LDA model that d is at most c 1. The solution of problem (7) can be obtained by solving a generalized eigen-decomposition problem as

Sb W = (Sw + αID)WΛ, where Λ is a diagonal matrix containing the largest d eigenvalues.

2.5 Learning M When M is given, we can compute the transformation matrix W. However, in most applications, M is unknown and we need to learn it from data automatically. We propose to use the objective function of the MLDA model, which encourages the class separability, to learn M, that is, maximizing tr WT (Sw + αID)W 1 WT Sb W

with respect to W and M. However, this problem contains two variables, making it not very easy to be solved. In the following theorem, we introduce its upper bound which can be used as a surrogate function. Theorem 3

max W RD d tr WT (Sw + αID)W 1 WT Sb W

tr (Sw + αID) 1Sb . By using Theorem 3, we can use the upper bound as a surrogate function to learn M as

max M tr (Sw + αID) 1 Sb . (8)

If directly solving this problem, we may get a trivial solution such as mi = xi. In order to avoid this situation, we can utilize the structural sparsity among all mi s that each class has a limited number of clusters, which corresponds to the situation that the number of unique instance-speciﬁc class means for each class is not very large or equivalently that many elements in {mi mj|yi = yj} are zero vectors. In the next section, we discuss how to learn M via the structurally sparse regularization without knowing the number of clusters in each class.

3 Learning M via Structural Sparsity When there is no information available about M, we need to learn M from data automatically. In this case, we utilize the structurally sparse regularization [Hocking et al., 2011; Bach et al., 2012b; Bach et al., 2012a] to enforce the sparsity in the set {mi mj|yi = yj}. The objective function to learn M is formulated as

i<j,yi=yj γij mi mj p tr (Sw + αID) 1Sb , (9)

where Sb and Sw are the generalized between-class and withinclass scatter matrices deﬁned in Eqs. (4) and (5), β is a positive regularization parameter, p denotes the ℓp norm of a vector, and γij denotes the nonnegative similarity between xi and xj. γij can be set to 1 by assuming that each pair of data points are

equal of similarity or to be exp{ xi xj 2 2 2ω2 } to incorporate the local density into the consideration. Similar to [Hocking et al., 2011], the regularization term in problem (9) (i.e., the ﬁrst term) enforces the difference between some pairs of the instance-speciﬁc class means corresponding to data points in the same class to be zero, leading to a limited number of unique instance-speciﬁc class means in each class. In order to make the subproblem in the proximal method convex as we will see later, p is assumed to be no less than 1, i.e., p 1. Moreover, when p equals 1, the regularization term only enforces some corresponding entries in some pairs of columns in M but not the entire columns to be identical, which does not match our expectation. Hence p is assumed to be large than 1, i.e., p > 1. Obviously problem (9) is non-convex due to the nonconvexity of the second term in problem (9). To solve problem (9), we use a proximal method for non-convex optimization problems called the GIST algorithm [Gong et al., 2013]. To apply the GIST algorithm to problem (9), we deﬁne f(M) and g(M) as f(M) = tr (Sw + αID) 1Sb

and g(M) = β P i<j yi=yj γij mi mj p. We rewrite Sb

and Sw in Eqs. (4) and (5) as Sb = 1 n2 MLMT and Sw = 1 n(X M)(X M)T , where X = (x1, . . . , xn), G is an n n matrix with the (i, j)th element being 1 when yi = yj and 0 otherwise, and L is the Laplacian matrix of G. Then we can compute the gradient f(M) as f(M) = 2 n3 S 1 w MLMT S 1 w (M X) 2

n2 S 1 w ML, where Sw = Sw + αID. In the GIST algorithm, we only need to solve the following problem:

2 M ˆ M(k) 2 F + β

i<j,yi=yj γij mi mj p

where ˆ M(k) = M(k) 1 tk f(M(k)). Note that in the above problem, two submatrices of M, Mi and Mj for any i = j, where matrix Mi consists of all the instance-speciﬁc class means for the ith class as its columns, are decoupled in the two terms of the objective function and so the above problem can be decomposed into c independent problems with the ith one formulated as

2 Mi ˆ M(k) i 2 F + β X

j,k Ri,j<k γjk mj mk p, (10)

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

where Ri = {j|yj = i}. By this decomposition, we not only reduce the size of the problem to be optimized but also enable the parallel optimization for the c independent problems. In the following section, we discuss the optimization procedure to solve problem (10) in details.

3.1 Optimization Procedure for Problem (10) In this section, we discuss how to solve problem (10). To simplify the presentation, we use slightly different notations from problem (10) to formulate it as

min Z RD m t 2 Z A 2 F +

j=1,j>i θij zi zj p, (11)

where zi is the ith column of Z. Note that problems (11) and (10) are equivalent. Here we investigate a case that p = 2 and other cases will be left for future study. Problem (11) is a convex proximal problem and due to the complex regularizer on Z, it has no closed-form solution. Here we use the proximal average technique [Yu, 2013] to solve it. The proximal average method approximates the solution of a proximal problem with a convex combination of multiple regularizers by a convex combination of the solutions of proximal problems with each regularizer individually. One advantage of the proximal average method is that each simple proximal problem can have an analytical solution and hence the approximated solution also has, leading to a low complexity for solving problem (11). For problem (11), without loss of generality, we assume that P i,j θij = 1. Then we can see that the regularizer in problem (11) is a convex combination of simple regularizers zi zj p. Suppose Z(i,j) is the solution of the following problem as

min Z t 2 Z A 2 F + zi zj p. (12)

Then the proximal average method approximate the solution of problem (11) by P j>i θij Z(i,j). For problem (12), it has an analytical solution as shown in the following theorem.

Theorem 4 When p = 2, problem (12) has an analytical solution as

zk =ak k = i, j

( 1 2(ai + aj) if ai aj 2 2/t ai ai aj t ai aj 2 otherwise ,

( 1 2(ai + aj) if ai aj 2 2/t aj aj ai t ai aj 2 otherwise

where ai is the ith column of A.

Moreover, we can divide the set of index pairs {(i, j)|i < j} into several subsets S1, . . . , St so that in each subset Sl, the indices in different pairs are non-overlapping, i.e., for any two different pairs (i1, j1) and (i2, j2) in Sl, i1, j1, i2, j2 are different from each other. Suppose Z(l) is the solution of the following problem as

min Z t 2 Z A 2 F + 1

(i,j) Sl θij zi zj p, (13)

where θ(l) = P (i,j) Sl θij. Then the proximal average method approximates the solution of problem (11) by P l θ(l)Z(l). Similar to problem (12), problem (13) has an analytical solution by using a similar analysis and we omit the details. For the construction of the subsets, we can adopt a greedy approach which S1 choose successive indices as a pair, S2 choose the indices with distance 2 as a pair, and so on. It is easy to show that t = O(m). For the two proximal average variants, we can compute the solutions of all the subproblems (i.e., problems (12) and (13)) in a parallel way, which can accelerate the training process. It is easy to show that problem (12) needs to be solved for O(m2) times but problem (13) is only needed for O(m) times. So for complexity consideration, we are more preferable to the second approach even though the analysis in the ﬁrst approach is the footstone of the second one. For the whole GIST algorithm where the proximal subproblem is solved by the proximal average method approximately, by following [Zhong and Kwok, 2014] we can prove that the algorithm can converge to a critical point, which can be a local optimum for problem (9).

3.2 Discussion To see which condition leads to different instance-speciﬁc class means of the same value, we investigate a special case of problem (9) that p equals 2 and γij is equal to γ for all pairs (i, j) where γ is a constant. Suppose in a class there exists a cluster consisting of two data points xi and xj only with their instance-speciﬁc class means denoted by mi and mj. By setting the derivative of the objective function in problem (9) with respect to mi and mj to zero respectively, we can get the stationary condition as

βγ + mi mj 2

mi = 0 (14)

βγ + mi mj 2

mj = 0, (15)

where f(M) = tr (Sw + αID) 1Sb and g(x)

x denotes the (sub)gradient of a function g( ) with respect to a variable x. Since xi and xj compose a cluster which implies that mi is equal to mj and mk is not equal to mi for all other k s in the same class, then the ﬁrst terms in Eqs. (14) and (15) are equal to each other and based on the chain rule, the last terms in those two equations are equal to the subgradients 0 2 and 0 2 which satisfy that their ℓ2 norms are no larger than 1. By computing the difference between Eqs. (14) and (15), we can have f(M)

2 2βγ, which implies

that data points with similar gradients of the function f( ) are more likely to share the same instance-speciﬁc class mean. So for nearby data points in one class, the difference between their gradients can be small and so they tend to lie in a cluster.

4 Experiments In this section, we empirically test the performance of the MLDA model.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

The methods in comparison include the LDA method [Belhumeur et al., 1997], the LLDA method [Kim and Kittler, 2005], the SDA method [Zhu and Martínez, 2006], and the LFDA method [Sugiyama, 2007]. As discussed in the introduction, the LLDA method is a two-stage method which ﬁrst clusters the whole dataset via the k-means clustering algorithm and then learns a LDA model on each cluster, the SDA method can learn the number of clusters, and the LFDA method utilizes the local density to model the multimodal structure. Recall that the proposed objective function to learn M in problem (9) is non-convex, making it sensitive to the initial value of M. In the following experiments, the initial value for mi is set to be xi. The parameters (i.e., η, σ, and t0) in the GIST algorithm are set to be 2, 0.2 and 1 respectively. The regularization parameters α and β are selected via the 10-fold cross validation method from the candidate set {0.001, 0.01, 0.1, 1} and the hyperparameters of the baselines in comparison are also selected via the 10-fold cross validation. For fair comparison with the conventional LDA method, we set the reduced dimensions of all the methods in comparison to c 1 where c is the number of classes. After learning the transformations in all the methods, we use the nearest neighbor classiﬁer to make prediction.

4.1 Experiments on Synthetic Data

The ﬁrst synthetic data is generated as follows. There are two classes in this dataset. The ﬁrst class contains two clusters with the ﬁrst cluster generated from a Gaussian distribution N( 1 0 , I2) and the second one following N( 20 0 , I2). The second class has three clusters with them sampled from N( 10 10 , I2), N( 10 2 , I2), and N( h 10 15 i , I2) respectively. For each cluster in those two classes, we randomly generate 100 data points respectively to form the training set and the data set is plotted in Figure 1(a). The settings of the second synthetic data keep almost the same as those of the ﬁrst one with the difference lying in the setting of the covariance matrices in different clusters. Here different covariance matrices are used for different classes, i.e., the two clusters in the ﬁrst class use 1.5 0 0 1

as the covariance matrix and all the covariance matrices of the three clusters in the second class equal 1 0 0 1.5

. The one-dimensional transformations of the LDA, SDA, LFDA, and MLDA methods for the two synthetic datasets are plotted in Figure 1. Here the LLDA method is not included since it cannot discover the clusters within each class. Based on the data distributions of two synthetic data, the ideal transformation is the horizontal line and we can compare with the transformations produced by the methods in comparison. From the results, we can see the discriminative ability of the transformation produced by the SDA model is the worst since some clusters in different classes are overlapped after the dimensionality reduction. One reason is that the data distribution violates an assumption of the SDA model that the numbers of clusters in different classes are equal to each other. The LDA model is better than the SDA model but the data points in different classes still have some overlap after the projection, leading to possible classiﬁcation error for testing.

5 0 5 10 15 20 25 30 30

LDA SDA LFDA MLDA

(a) Synthetic Data I

10 5 0 5 10 15 20 25 30 20

LDA SDA LFDA MLDA

(b) Synthetic Data II

Figure 1: Experiments on two synthetic datasets. Blue and black points are from the ﬁrst and second classes.

The LFDA and MLDA models have the best results since the learned transformations are almost horizontal. Moreover, one advantage of the proposed MLDA model over the LFDA model is that the MLDA model can identify the clusters in each class via the learned M. According to M, we ﬁnd the clusters found by the MLDA model are exactly the same as the ground truth on those two synthetic datasets, which demonstrates the cluster-ﬁnding ability of the MLDA model. The ﬁve respective instance-speciﬁc class means corresponding to the ﬁve clusters in the ﬁrst dataset are h 0.8864 0.0386 i , h 20.3167 0.0110 i , 9.6911 10.0980 , 10.0881 1.8842 , h 9.9684 15.2609 i , and those in the second dataset

are 0.6381 0.0925 , h 20.3047 0.0310 i , 10.0538 9.8551 , 9.9588 2.2973 , h 9.8401 15.4004 i . We can see that the unique instance-speciﬁc class means are very close to the means of the underlying Gaussian distributions, which demonstrates the effectiveness of the MLDA model on these two datasets.

4.2 Experiments on Real-World Datasets

Six real-world datasets are used in our experiments, including the ETH-80, COIL-20, COIL-100, AR, UMIST, and MNIST databases. The ETH-80 dataset [Leibe and Schiele, 2003] contains images of the following categories: apples, pears,

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

(b) COIL-20

(c) COIL-100

(f) MNIST Figure 2: Test errors on real-world datasets when varying the size of the training set.

cars, cows, horses, dogs, tomatoes, and cups, and each category includes the images of 10 objects taken at 41 orientations, which give us a total of 410 images per category. The COIL-20 dataset contains 1,440 gray-scale images for 20 objects and each object has 72 images of size 16 16 with the difference of two successive viewpoint as 5 degrees. The images in the COIL-100 dataset are taken in a similar way to the COIL-20 dataset and it contains 7,200 gray-scale images for 100 objects. The AR dataset [Martínez and Benavente, 1998] contains frontal face images of 100 persons (50 men and 50 women) with different expressions, illuminations, and occlusions and there are 26 images for each person taken in two sessions, each having 13 images. The UMIST dataset [Graham and Allinson, 1998] is a multi-view dataset consisting of 575 gray-scale images of 20 people (subject) with each covering a wide range of poses from proﬁle to frontal views. The MNIST dataset [Le Cun and Cortes, 1998] contains 70,000 handwritten digits ranging from 0 to 9 with each one having about 7,000 samples where each image is normalized to a gray-level image with the size as 14 14.

In the proposed MLDA method, γij is set to be exp{ 1

σ2 yi xi xj 2 2} when yi = yj and 0 otherwise, where σi denotes the average pairwise Euclidean distance in the ith class. In order to investigate the effect of varying the size of the training set, we randomly sample 30%, 50%, and 70% of the total data as the training set and the rest forms the test set. Each conﬁguration repeats for 10 times and the average results as well as the standard deviations are plotted in Figure 2. From the results, we can see that sometimes the performance of the LLDA, SDA and LFDA models is better than that of the LDA, for example, on the COIL-20 and COIL-100 datasets, but in other cases, they perform only comparably to the LDA model. The proposed MLDA method has the best performance in all

settings. In the ETH-80 and AR datasets, the performance of the proposed MLDA method using 30% data for training is comparable to that of the SDA or LFDA method with the training percentage as 70%, which in some aspect shows the effectiveness of the MLDA method.

5 Conclusion In this paper, we propose a multimodal extension of the LDA model by generalizing the between-class and within-class scatter matrices based on the instance-speciﬁc class means assigned to each data point. The learning of the instance-speciﬁc class means is accomplished by maximizing a surrogate function regularized by the structural sparsity and the resultant objective function is solved by a proximal algorithm. The objective function in problem (8) can be viewed as a loss function, which encourages the class separability, for M, but it is not the only one. We will try other criterions such as the stability criterion proposed in [Martínez and Zhu, 2005] and the implementation can be reused since we only need to modify the deﬁnition of the function f( ) and the calculation of its gradient. Moreover, the proposed MLDA model is a linear model and we are interested in investigating its nonlinear extension.

Acknowledgments This work is supported by the NSFC (61473087, 61673201, 61673202), the Collaborative Innovation Center of Novel Software Technology and Industrialization, and the Natural Science Foundation of Jiangsu Province (BK20141340).

References [Bach et al., 2012a] Francis R. Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optimization

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1):1 106, 2012. [Bach et al., 2012b] Francis R. Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Structured sparsity through convex optimization. Statistical Science, 27(4):450 468, 2012. [Belhumeur et al., 1997] Peter N. Belhumeur, João P. Hespanha, and David J. Kriegman. Eigenfaces vs. Fisherfaces: Recognition using class speciﬁc linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):711 720, 1997. [Gong et al., 2013] Pinghua Gong, Changshui Zhang, Zhaosong Lu, Jianhua Huang, and Jieping Ye. A general iterative shrinkage and thresholding algorithm for nonconvex regularized optimization problems. In Proceedings of the 30th International Conference on Machine Learning, pages 37 45, 2013. [Graham and Allinson, 1998] Daniel B. Graham and Nigel Allinson. Characterizing virtual eigensignatures for general purpose face recognition. In Face Recognition: From Theory to Applications, volume 163, pages 446 456. Springer, 1998. [Hastie and Tibshirani, 1996] Trevor Hastie and Robert Tibshirani. Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society, Series B, 58(1):155 176, 1996. [Hocking et al., 2011] Toby Hocking, Jean-Philippe Vert, Francis R. Bach, and Armand Joulin. Clusterpath: An algorithm for clustering using convex fusion penalties. In Proceedings of the 28th International Conference on Machine Learning, pages 745 752, 2011. [Kim and Kittler, 2005] Tae-Kyun Kim and Josef Kittler. Locally linear discriminant analysis for multimodally distributed classes for face recognition with a single model image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(3):318 327, 2005. [Kuo and Landgrebe, 2004] B.-C. Kuo and D. A. Landgrebe. Nonparametric weighted feature extraction for classiﬁcation. IEEE Transactions on Geoscience and Remote Sensing, 42(5):1096 1105, 2004. [Le Cun and Cortes, 1998] Yann Le Cun and Corinna Cortes. The MNIST database of handwritten digits, 1998. [Leibe and Schiele, 2003] Bastian Leibe and Bernt Schiele. Analyzing appearance and contour based methods for object categorization. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 409 415, 2003. [Li et al., 2009] Zhifeng Li, Dahua Lin, and Xiaoou Tang. Nonparametric discriminant analysis for face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4):755 761, 2009. [Martínez and Benavente, 1998] Aleix M. Martínez and Robert Benavente. The AR-face database. Technical Report 24, CVC, 1998.

[Martínez and Zhu, 2005] Aleix M. Martínez and Manli Zhu. Where are linear feature extraction methods applicable? IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1934 1944, 2005. [Sugiyama, 2006] Masashi Sugiyama. Local Fisher discriminant analysis for supervised dimensionality reduction. In Proceedings of the 23rd International Conference on Machine Learning, pages 905 912, 2006. [Sugiyama, 2007] Masashi Sugiyama. Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis. Journal of Machine Learning Research, 8:1027 1061, 2007. [Yu, 2013] Yaoliang Yu. Better approximation and faster algorithm using the proximal average. In Advances in Neural Information Processing Systems 26, pages 458 466, 2013. [Zhong and Kwok, 2014] Wenliang Zhong and James T. Kwok. Gradient descent with proximal average for nonconvex and composite regularization. In Proceedings of the 28th AAAI Conference on Artiﬁcial Intelligence, pages 2206 2212, 2014. [Zhu and Martínez, 2006] Manli Zhu and Aleix M. Martínez. Subclass discriminant analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(8):1274 1286, 2006.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)