# multimodal_linear_discriminant_analysis_via_structural_sparsity__cda78466.pdf Multimodal Linear Discriminant Analysis via Structural Sparsity Yu Zhang1 and Yuan Jiang2 1Department of Computer Science and Engineering, Hong Kong University of Science and Technology 2National Key Laboratory for Novel Software Technology, Nanjing University zhangyu@cse.ust.hk, jiangy@lamda.nju.edu.cn Linear discriminant analysis (LDA) is a widely used supervised dimensionality reduction technique. Even though the LDA method has many real-world applications, it has some limitations such as the single-modal problem that each class follows a normal distribution. To solve this problem, we propose a method called multimodal linear discriminant analysis (MLDA). By generalizing the between-class and within-class scatter matrices, the MLDA model can allow each data point to have its own class mean which is called the instance-specific class mean. Then in each class, data points which share the same or similar instance-specific class means are considered to form one cluster or modal. In order to learn the instance-specific class means, we use the ratio of the proposed generalized between-class scatter measure over the proposed generalized within-class scatter measure, which encourages the class separability, as a criterion. The observation that each class will have a limited number of clusters inspires us to use a structural sparse regularizor to control the number of unique instance-specific class means in each class. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed MLDA method. 1 Introduction In the LDA model, each class is modeled by a Gaussian distribution, which is not capable of handling the multimodal structure contained in the data. The mixture discriminant analysis (MDA) [Hastie and Tibshirani, 1996] is one of the earliest multimodal extensions of the LDA model and it uses a Gaussian mixture model to describe the data in a class with the number of Gaussian components pre-defined and shared among different classes. Different from the MDA model, the LLDA method proposed in [Kim and Kittler, 2005] first groups the whole dataset by using the Gaussian mixture model or k-means clustering algorithm and then learns a LDA model on each cluster. The subclass discriminant analysis (SDA) method proposed in [Zhu and Martínez, 2006] Corresponding Author aims at learning the number of clusters in each class based on the leave-one-out-test criterion or a stability criterion proposed in [Martínez and Zhu, 2005]. The local Fisher discriminant analysis (LFDA) proposed in [Sugiyama, 2006; Sugiyama, 2007], which have similar ideas to nonparametric discriminant analysis [Kuo and Landgrebe, 2004; Li et al., 2009], conquers the multimodal problem by incorporating the local structure into the definitions of the within-class and between-class scatter matrices. One by-product of those multimodal LDA models is that the dimension of the embedding space is no longer limited to at most the number of classes minus 1 since the rank of the between-class scatter matrix becomes larger by exploiting the multimodal structure. All the existing multimodal extensions of the LDA model have some limitations. For example, the MDA and LLDA models require that the number of components should be predefined, which is not an easy model selection problem. The assumptions of the SDA method that different classes have the same number of clusters and that clusters in each class have comparable numbers of data points seem a bit restricted. Moreover, the SDA method consists of two stages with the first stage learning the cluster structure while the second one learns the transformation, leading to a suboptimal learner. For the LFDA method, it is not easy to infer the cluster structure from data. In this paper, we propose a method called multimodal linear discriminant analysis (MLDA) to alleviate those limitations in existing multimodal extensions of the LDA model. The MLDA model generalizes the definitions of the between-class and within-class scatter matrices by introducing the instancespecific class means. That is, different from the traditional LDA model which has only one class mean for each class, in the MLDA method each data point has its own class mean and data points, which share the same or similar class means, in each class are considered to belong to a cluster or modal. We reveal that the conventional scatter matrices are special cases of the generalized ones. Since the instance-specific class means are unknown for real-world problems, we utilize the ratio of the proposed generalized between-class scatter measure over the proposed generalized within-class scatter measure as an objective function to learn them. Usually each class has a limited number of clusters, implying that the number of unique instance-specific class means in each class is not very large, and this observation inspires us to use a structurally Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) sparse regularizer (i.e., the ℓ1,p norm) to enforce some pairs of instance-specific class means to be identical and in the meanwhile to control the number of unique instance-specific class means in each class. One advantage to use the structurally sparse regularizer is that we need not to specify the number of clusters and not to impose constraints on the cluster structure. We devise a proximal average method based on the GIST algorithm [Gong et al., 2013] to solve the resulting objective function with each subproblem having an analytical solution. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of our proposed method. 2 The MLDA Model In this section, we first review the conventional LDA model and then present the generalized scatter matrices, which set the stage for the introduction of the proposed MLDA model, as well as their properties. 2.1 LDA Revisited The LDA model is a supervised dimensionality reduction method. Suppose that a training dataset consists of n pairs of training samples {xi, yi}n i=1 where xi RD denotes the ith data point and its class label is denoted by yi {1, . . . , c}, making the learning problem a multi-class classification problem with the number of classes being c. Let ni denote the number of data points in the ith class and so n = Pc i=1 ni. Moreover, we define the overall mean m as the mean of all data points, i.e., m = 1 n Pn i=1 xi, and the class mean for the ith class as mi = 1 ni P yj=i xj. Then the between-class and within-class scatter matrices are defined as i=1 ni( mi m)( mi m)T , (1) i=1 (xi myi)(xi myi)T . (2) Then the LDA model seeks the optimal transformation matrix W by solving the following problem max W RD d tr WT (Qw + αID)W 1 WT Qb W , (3) where d is the dimension of the reduced space, Ia denotes an a a identity matrix, the superscript 1 denotes the matrix inverse, and α is a regularization parameter to guarantee the non-singularity. 2.2 Generalized Scatter Matrices According to problem (3) and the definition of the withinclass scatter matrix in Eq. (2), the LDA model enforces data points in one class to be close to the class mean, making it applicable to single-modal data. However, the data used in many applications exhibit multimodal structure, leading to unsatisfactory performance by using the LDA model. Here we propose the generalized within-class and between-class scatter matrices as j=1 δ(yi = yj)(mi mj)(mi mj)T (4) i=1 (xi mi)(xi mi)T , (5) where δ(z) returns 1 when z holds and otherwise 0, and mi represents the instance-specific class mean for xi. Suppose that the ith class has li unique instance-specific class means, i.e., the set {mj|yj = i} having li elements {mi k} for k = 1, . . . , li. Then we can rewrite the generalized within-class scatter matrix defined in Eq. (5) as k Ci,j (xk mi j)(xk mi j)T , where Ci,j = {k|mk = mi j} denotes the set of the indices corresponding to the data points belonging to the jth modal (or cluster) of the ith class. By using the generalized scatter matrices, we can easily model the multimodal structure contained in the data. 2.3 Properties The newly defined scatter matrices in Eqs. (4) and (5) are called the generalized between-class and within-class scatter matrices since the conventional scatter matrices are special cases of them, and the relation is revealed in the following theorem. Theorem 1 By setting mi to be mj when xi belongs to the jth class, we have Sb = Qb and Sw = Qw. Recall that the conventional between-class and within-class scatter matrices have one property that Qb + Qw = Qt where Qt = 1 n Pn i=1(xi m)(xi m)T is the total scatter matrix. Similar to that, the following theorem shows that the generalized scatter matrices have a similar property. Theorem 2 For Sb and Sw, we have Sb + Sw = 1 k Ci,j (xk m)(xk m)T k Ci,j (xk mi j)(mi j m)T k Ci,j (mi j m)(xk mi j)T j=1 ni,j(mi j mi)(mi j mi)T , where ni,j is the cardinality of Ci,j implying that the jth cluster of the ith class has ni,j data points, m = 1 n Pn i=1 mi, and mi = 1 ni P yj=i mj. Moreover, when mi j is set to be the average of all the data points in the jth cluster of the ith class, we have Sb + Sw = Qt j=1 ni,j(mi j mi)(mi j mi)T . (6) Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) According to Theorem 2, we can see that when {mi} take appropriate values, the sum of the generalized between-class and within-class scatter matrices equals the total scatter matrix minus a matrix called the average between-cluster scatter matrix whose formulation is just the second term in the righthand side of Eq. (6). The average between-cluster scatter matrix measures the separability between different clusters in the same class and is not useful for discriminating different classes. 2.4 The Model When given M = (m1, . . . , mn), we formulate the objective function of the MLDA model in a similar way to the conventional LDA model as max W RD d tr WT (Sw + αID)W 1 WT Sb W . (7) A by-product of this formulation is that if the unique instancespecific class means are linearly independent, the rank of Sb is Pc i=1 li 1 which is larger than c 1 and so the dimension of the reduced space d can be larger than c 1, which overcomes a limitation of the conventional LDA model that d is at most c 1. The solution of problem (7) can be obtained by solving a generalized eigen-decomposition problem as Sb W = (Sw + αID)WΛ, where Λ is a diagonal matrix containing the largest d eigenvalues. 2.5 Learning M When M is given, we can compute the transformation matrix W. However, in most applications, M is unknown and we need to learn it from data automatically. We propose to use the objective function of the MLDA model, which encourages the class separability, to learn M, that is, maximizing tr WT (Sw + αID)W 1 WT Sb W with respect to W and M. However, this problem contains two variables, making it not very easy to be solved. In the following theorem, we introduce its upper bound which can be used as a surrogate function. Theorem 3 max W RD d tr WT (Sw + αID)W 1 WT Sb W tr (Sw + αID) 1Sb . By using Theorem 3, we can use the upper bound as a surrogate function to learn M as max M tr (Sw + αID) 1 Sb . (8) If directly solving this problem, we may get a trivial solution such as mi = xi. In order to avoid this situation, we can utilize the structural sparsity among all mi s that each class has a limited number of clusters, which corresponds to the situation that the number of unique instance-specific class means for each class is not very large or equivalently that many elements in {mi mj|yi = yj} are zero vectors. In the next section, we discuss how to learn M via the structurally sparse regularization without knowing the number of clusters in each class. 3 Learning M via Structural Sparsity When there is no information available about M, we need to learn M from data automatically. In this case, we utilize the structurally sparse regularization [Hocking et al., 2011; Bach et al., 2012b; Bach et al., 2012a] to enforce the sparsity in the set {mi mj|yi = yj}. The objective function to learn M is formulated as i 1. Obviously problem (9) is non-convex due to the nonconvexity of the second term in problem (9). To solve problem (9), we use a proximal method for non-convex optimization problems called the GIST algorithm [Gong et al., 2013]. To apply the GIST algorithm to problem (9), we define f(M) and g(M) as f(M) = tr (Sw + αID) 1Sb and g(M) = β P ii θij zi zj p, (11) where zi is the ith column of Z. Note that problems (11) and (10) are equivalent. Here we investigate a case that p = 2 and other cases will be left for future study. Problem (11) is a convex proximal problem and due to the complex regularizer on Z, it has no closed-form solution. Here we use the proximal average technique [Yu, 2013] to solve it. The proximal average method approximates the solution of a proximal problem with a convex combination of multiple regularizers by a convex combination of the solutions of proximal problems with each regularizer individually. One advantage of the proximal average method is that each simple proximal problem can have an analytical solution and hence the approximated solution also has, leading to a low complexity for solving problem (11). For problem (11), without loss of generality, we assume that P i,j θij = 1. Then we can see that the regularizer in problem (11) is a convex combination of simple regularizers zi zj p. Suppose Z(i,j) is the solution of the following problem as min Z t 2 Z A 2 F + zi zj p. (12) Then the proximal average method approximate the solution of problem (11) by P j>i θij Z(i,j). For problem (12), it has an analytical solution as shown in the following theorem. Theorem 4 When p = 2, problem (12) has an analytical solution as zk =ak k = i, j ( 1 2(ai + aj) if ai aj 2 2/t ai ai aj t ai aj 2 otherwise , ( 1 2(ai + aj) if ai aj 2 2/t aj aj ai t ai aj 2 otherwise where ai is the ith column of A. Moreover, we can divide the set of index pairs {(i, j)|i < j} into several subsets S1, . . . , St so that in each subset Sl, the indices in different pairs are non-overlapping, i.e., for any two different pairs (i1, j1) and (i2, j2) in Sl, i1, j1, i2, j2 are different from each other. Suppose Z(l) is the solution of the following problem as min Z t 2 Z A 2 F + 1 (i,j) Sl θij zi zj p, (13) where θ(l) = P (i,j) Sl θij. Then the proximal average method approximates the solution of problem (11) by P l θ(l)Z(l). Similar to problem (12), problem (13) has an analytical solution by using a similar analysis and we omit the details. For the construction of the subsets, we can adopt a greedy approach which S1 choose successive indices as a pair, S2 choose the indices with distance 2 as a pair, and so on. It is easy to show that t = O(m). For the two proximal average variants, we can compute the solutions of all the subproblems (i.e., problems (12) and (13)) in a parallel way, which can accelerate the training process. It is easy to show that problem (12) needs to be solved for O(m2) times but problem (13) is only needed for O(m) times. So for complexity consideration, we are more preferable to the second approach even though the analysis in the first approach is the footstone of the second one. For the whole GIST algorithm where the proximal subproblem is solved by the proximal average method approximately, by following [Zhong and Kwok, 2014] we can prove that the algorithm can converge to a critical point, which can be a local optimum for problem (9). 3.2 Discussion To see which condition leads to different instance-specific class means of the same value, we investigate a special case of problem (9) that p equals 2 and γij is equal to γ for all pairs (i, j) where γ is a constant. Suppose in a class there exists a cluster consisting of two data points xi and xj only with their instance-specific class means denoted by mi and mj. By setting the derivative of the objective function in problem (9) with respect to mi and mj to zero respectively, we can get the stationary condition as βγ + mi mj 2 mi = 0 (14) βγ + mi mj 2 mj = 0, (15) where f(M) = tr (Sw + αID) 1Sb and g(x) x denotes the (sub)gradient of a function g( ) with respect to a variable x. Since xi and xj compose a cluster which implies that mi is equal to mj and mk is not equal to mi for all other k s in the same class, then the first terms in Eqs. (14) and (15) are equal to each other and based on the chain rule, the last terms in those two equations are equal to the subgradients 0 2 and 0 2 which satisfy that their ℓ2 norms are no larger than 1. By computing the difference between Eqs. (14) and (15), we can have f(M) 2 2βγ, which implies that data points with similar gradients of the function f( ) are more likely to share the same instance-specific class mean. So for nearby data points in one class, the difference between their gradients can be small and so they tend to lie in a cluster. 4 Experiments In this section, we empirically test the performance of the MLDA model. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) The methods in comparison include the LDA method [Belhumeur et al., 1997], the LLDA method [Kim and Kittler, 2005], the SDA method [Zhu and Martínez, 2006], and the LFDA method [Sugiyama, 2007]. As discussed in the introduction, the LLDA method is a two-stage method which first clusters the whole dataset via the k-means clustering algorithm and then learns a LDA model on each cluster, the SDA method can learn the number of clusters, and the LFDA method utilizes the local density to model the multimodal structure. Recall that the proposed objective function to learn M in problem (9) is non-convex, making it sensitive to the initial value of M. In the following experiments, the initial value for mi is set to be xi. The parameters (i.e., η, σ, and t0) in the GIST algorithm are set to be 2, 0.2 and 1 respectively. The regularization parameters α and β are selected via the 10-fold cross validation method from the candidate set {0.001, 0.01, 0.1, 1} and the hyperparameters of the baselines in comparison are also selected via the 10-fold cross validation. For fair comparison with the conventional LDA method, we set the reduced dimensions of all the methods in comparison to c 1 where c is the number of classes. After learning the transformations in all the methods, we use the nearest neighbor classifier to make prediction. 4.1 Experiments on Synthetic Data The first synthetic data is generated as follows. There are two classes in this dataset. The first class contains two clusters with the first cluster generated from a Gaussian distribution N( 1 0 , I2) and the second one following N( 20 0 , I2). The second class has three clusters with them sampled from N( 10 10 , I2), N( 10 2 , I2), and N( h 10 15 i , I2) respectively. For each cluster in those two classes, we randomly generate 100 data points respectively to form the training set and the data set is plotted in Figure 1(a). The settings of the second synthetic data keep almost the same as those of the first one with the difference lying in the setting of the covariance matrices in different clusters. Here different covariance matrices are used for different classes, i.e., the two clusters in the first class use 1.5 0 0 1 as the covariance matrix and all the covariance matrices of the three clusters in the second class equal 1 0 0 1.5 . The one-dimensional transformations of the LDA, SDA, LFDA, and MLDA methods for the two synthetic datasets are plotted in Figure 1. Here the LLDA method is not included since it cannot discover the clusters within each class. Based on the data distributions of two synthetic data, the ideal transformation is the horizontal line and we can compare with the transformations produced by the methods in comparison. From the results, we can see the discriminative ability of the transformation produced by the SDA model is the worst since some clusters in different classes are overlapped after the dimensionality reduction. One reason is that the data distribution violates an assumption of the SDA model that the numbers of clusters in different classes are equal to each other. The LDA model is better than the SDA model but the data points in different classes still have some overlap after the projection, leading to possible classification error for testing. 5 0 5 10 15 20 25 30 30 LDA SDA LFDA MLDA (a) Synthetic Data I 10 5 0 5 10 15 20 25 30 20 LDA SDA LFDA MLDA (b) Synthetic Data II Figure 1: Experiments on two synthetic datasets. Blue and black points are from the first and second classes. The LFDA and MLDA models have the best results since the learned transformations are almost horizontal. Moreover, one advantage of the proposed MLDA model over the LFDA model is that the MLDA model can identify the clusters in each class via the learned M. According to M, we find the clusters found by the MLDA model are exactly the same as the ground truth on those two synthetic datasets, which demonstrates the cluster-finding ability of the MLDA model. The five respective instance-specific class means corresponding to the five clusters in the first dataset are h 0.8864 0.0386 i , h 20.3167 0.0110 i , 9.6911 10.0980 , 10.0881 1.8842 , h 9.9684 15.2609 i , and those in the second dataset are 0.6381 0.0925 , h 20.3047 0.0310 i , 10.0538 9.8551 , 9.9588 2.2973 , h 9.8401 15.4004 i . We can see that the unique instance-specific class means are very close to the means of the underlying Gaussian distributions, which demonstrates the effectiveness of the MLDA model on these two datasets. 4.2 Experiments on Real-World Datasets Six real-world datasets are used in our experiments, including the ETH-80, COIL-20, COIL-100, AR, UMIST, and MNIST databases. The ETH-80 dataset [Leibe and Schiele, 2003] contains images of the following categories: apples, pears, Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) (b) COIL-20 (c) COIL-100 (f) MNIST Figure 2: Test errors on real-world datasets when varying the size of the training set. cars, cows, horses, dogs, tomatoes, and cups, and each category includes the images of 10 objects taken at 41 orientations, which give us a total of 410 images per category. The COIL-20 dataset contains 1,440 gray-scale images for 20 objects and each object has 72 images of size 16 16 with the difference of two successive viewpoint as 5 degrees. The images in the COIL-100 dataset are taken in a similar way to the COIL-20 dataset and it contains 7,200 gray-scale images for 100 objects. The AR dataset [Martínez and Benavente, 1998] contains frontal face images of 100 persons (50 men and 50 women) with different expressions, illuminations, and occlusions and there are 26 images for each person taken in two sessions, each having 13 images. The UMIST dataset [Graham and Allinson, 1998] is a multi-view dataset consisting of 575 gray-scale images of 20 people (subject) with each covering a wide range of poses from profile to frontal views. The MNIST dataset [Le Cun and Cortes, 1998] contains 70,000 handwritten digits ranging from 0 to 9 with each one having about 7,000 samples where each image is normalized to a gray-level image with the size as 14 14. In the proposed MLDA method, γij is set to be exp{ 1 σ2 yi xi xj 2 2} when yi = yj and 0 otherwise, where σi denotes the average pairwise Euclidean distance in the ith class. In order to investigate the effect of varying the size of the training set, we randomly sample 30%, 50%, and 70% of the total data as the training set and the rest forms the test set. Each configuration repeats for 10 times and the average results as well as the standard deviations are plotted in Figure 2. From the results, we can see that sometimes the performance of the LLDA, SDA and LFDA models is better than that of the LDA, for example, on the COIL-20 and COIL-100 datasets, but in other cases, they perform only comparably to the LDA model. The proposed MLDA method has the best performance in all settings. In the ETH-80 and AR datasets, the performance of the proposed MLDA method using 30% data for training is comparable to that of the SDA or LFDA method with the training percentage as 70%, which in some aspect shows the effectiveness of the MLDA method. 5 Conclusion In this paper, we propose a multimodal extension of the LDA model by generalizing the between-class and within-class scatter matrices based on the instance-specific class means assigned to each data point. The learning of the instance-specific class means is accomplished by maximizing a surrogate function regularized by the structural sparsity and the resultant objective function is solved by a proximal algorithm. The objective function in problem (8) can be viewed as a loss function, which encourages the class separability, for M, but it is not the only one. We will try other criterions such as the stability criterion proposed in [Martínez and Zhu, 2005] and the implementation can be reused since we only need to modify the definition of the function f( ) and the calculation of its gradient. Moreover, the proposed MLDA model is a linear model and we are interested in investigating its nonlinear extension. Acknowledgments This work is supported by the NSFC (61473087, 61673201, 61673202), the Collaborative Innovation Center of Novel Software Technology and Industrialization, and the Natural Science Foundation of Jiangsu Province (BK20141340). References [Bach et al., 2012a] Francis R. Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optimization Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1):1 106, 2012. [Bach et al., 2012b] Francis R. Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Structured sparsity through convex optimization. Statistical Science, 27(4):450 468, 2012. [Belhumeur et al., 1997] Peter N. Belhumeur, João P. Hespanha, and David J. Kriegman. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):711 720, 1997. [Gong et al., 2013] Pinghua Gong, Changshui Zhang, Zhaosong Lu, Jianhua Huang, and Jieping Ye. A general iterative shrinkage and thresholding algorithm for nonconvex regularized optimization problems. In Proceedings of the 30th International Conference on Machine Learning, pages 37 45, 2013. [Graham and Allinson, 1998] Daniel B. Graham and Nigel Allinson. Characterizing virtual eigensignatures for general purpose face recognition. In Face Recognition: From Theory to Applications, volume 163, pages 446 456. Springer, 1998. [Hastie and Tibshirani, 1996] Trevor Hastie and Robert Tibshirani. Discriminant analysis by Gaussian mixtures. Journal of the Royal Statistical Society, Series B, 58(1):155 176, 1996. [Hocking et al., 2011] Toby Hocking, Jean-Philippe Vert, Francis R. Bach, and Armand Joulin. Clusterpath: An algorithm for clustering using convex fusion penalties. In Proceedings of the 28th International Conference on Machine Learning, pages 745 752, 2011. [Kim and Kittler, 2005] Tae-Kyun Kim and Josef Kittler. Locally linear discriminant analysis for multimodally distributed classes for face recognition with a single model image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(3):318 327, 2005. [Kuo and Landgrebe, 2004] B.-C. Kuo and D. A. Landgrebe. Nonparametric weighted feature extraction for classification. IEEE Transactions on Geoscience and Remote Sensing, 42(5):1096 1105, 2004. [Le Cun and Cortes, 1998] Yann Le Cun and Corinna Cortes. The MNIST database of handwritten digits, 1998. [Leibe and Schiele, 2003] Bastian Leibe and Bernt Schiele. Analyzing appearance and contour based methods for object categorization. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 409 415, 2003. [Li et al., 2009] Zhifeng Li, Dahua Lin, and Xiaoou Tang. Nonparametric discriminant analysis for face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4):755 761, 2009. [Martínez and Benavente, 1998] Aleix M. Martínez and Robert Benavente. The AR-face database. Technical Report 24, CVC, 1998. [Martínez and Zhu, 2005] Aleix M. Martínez and Manli Zhu. Where are linear feature extraction methods applicable? IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1934 1944, 2005. [Sugiyama, 2006] Masashi Sugiyama. Local Fisher discriminant analysis for supervised dimensionality reduction. In Proceedings of the 23rd International Conference on Machine Learning, pages 905 912, 2006. [Sugiyama, 2007] Masashi Sugiyama. Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis. Journal of Machine Learning Research, 8:1027 1061, 2007. [Yu, 2013] Yaoliang Yu. Better approximation and faster algorithm using the proximal average. In Advances in Neural Information Processing Systems 26, pages 458 466, 2013. [Zhong and Kwok, 2014] Wenliang Zhong and James T. Kwok. Gradient descent with proximal average for nonconvex and composite regularization. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, pages 2206 2212, 2014. [Zhu and Martínez, 2006] Manli Zhu and Aleix M. Martínez. Subclass discriminant analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(8):1274 1286, 2006. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)