# deep_linear_discriminant_analysis__b3924e7c.pdf

Published as a conference paper at ICLR 2016

DEEP LINEAR DISCRIMINANT ANALYSIS

Matthias Dorfer, Rainer Kelz & Gerhard Widmer Department of Computational Perception Johannes Kepler University Linz Linz, 4040, AUT {matthias.dorfer, rainer.kelz, gerhard.widmer}@jku.at

We introduce Deep Linear Discriminant Analysis (Deep LDA) which learns linearly separable latent representations in an end-to-end fashion. Classic LDA extracts features which preserve class separability and is used for dimensionality reduction for many classiﬁcation problems. The central idea of this paper is to put LDA on top of a deep neural network. This can be seen as a non-linear extension of classic LDA. Instead of maximizing the likelihood of target labels for individual samples, we propose an objective function that pushes the network to produce feature distributions which: (a) have low variance within the same class and (b) high variance between different classes. Our objective is derived from the general LDA eigenvalue problem and still allows to train with stochastic gradient descent and back-propagation. For evaluation we test our approach on three different benchmark datasets (MNIST, CIFAR-10 and STL-10). Deep LDA produces competitive results on MNIST and CIFAR-10 and outperforms a network trained with categorical cross entropy (having the same architecture) on a supervised setting of STL-10.

1 INTRODUCTION

Linear Discriminant Analysis (LDA) is a method from multivariate statistics which seeks to ﬁnd a linear projection of high-dimensional observations into a lower-dimensional space (Fisher, 1936). When its preconditions are fulﬁlled, LDA allows to deﬁne optimal linear decision boundaries in the resulting latent space. The aim of this paper is to exploit the beneﬁcial properties of classic LDA (low intra class variability, hight inter-class variability, optimal decision boundaries) by reformulating its objective to learn linearly separable representations based on a deep neural network (DNN).

Recently, methods related to LDA achieved great success in combination with deep neural networks. Andrew et al. published a deep version of Canonical Correlation Analysis (DCCA) (Andrew et al., 2013). In their evaluations, DCCA is used to produce correlated representations of multi-modal input data of simultaneously recorded acoustic and articulatory speech data. Clevert et al. propose Rectiﬁed Factor Networks (RFNs) which are a neural network interpretation of classic factor analysis (Clevert et al., 2015). RFNs are used for unsupervised pre-training and help to improve classiﬁcation performance on four different benchmark datasets. A similar method called PCANet as well as an LDA based variation was proposed by Chan et al. (2015). PCANet can be seen as a simple unsupervised convolutional deep learning approach. The method proceeds with cascaded Principal Component Analysis (PCA), binary hashing and block histogram computations. However, one crucial bottleneck of their approach is its limitation to very shallow architectures (two stages) (Chan et al., 2015).

Stuhlsatz et. al. already picked up the idea of combining LDA with a neural networks and proposed a generalized version of LDA (Stuhlsatz et al., 2012). Their approach starts with pre-training a stack of restricted Boltzmann machines. In a second step, the pre-trained model is ﬁne-tuned with respect to a linear discriminant criterion. LDA has the disadvantage that it overemphasises large distances at the cost of confusing neighbouring classes. In (Stuhlsatz et al., 2012) this problem is tackled by a heuristic weighting scheme for computing the within-class scatter matrix required for LDA optimization.

http://www.cp.jku.at/

ar Xiv:1511.04707v5 [cs.LG] 17 Feb 2016

Published as a conference paper at ICLR 2016

1.1 MAIN IDEA OF THIS PAPER

The approaches mentioned so far all have in common that they are based on well established methods from multivariate statistics. Inspired by their work, we propose an end-to-end DNN version of LDA - namely Deep Linear Discriminant Analysis (Deep LDA).

Deep learning has become the state of the art in automatic feature learning and replaced existing approaches based on hand engineered features in many ﬁelds such as object recognition (Krizhevsky et al., 2012). Deep LDA is motivated by the fact that when the preconditions of LDA are met, it is capable of ﬁnding linear combinations of the input features which allow for optimal linear decision boundaries. In general, LDA takes features as input. The intuition of our method is to use LDA as an objective on top of a powerful feature learning algorithm. Instead of maximizing the likelihood of target labels for individual samples, we propose an LDA eigenvalue-based objective function that pushes the network to produce discriminative feature distributions. The parameters are optimized by back-propagating the error of an LDA-based objective through the entire network. We tackle the feature learning problem by focusing on directions in the latent space with smallest discriminative power. This replaces the weighting scheme of (Stuhlsatz et al., 2012) and allows to operate on the original formulation of LDA. We expect that Deep LDA will produce linearly separable hidden representations with similar discriminative power in all directions of the latent space. Such representations should also be related with a high classiﬁcation potential of the respective networks. The experimental classiﬁcation results reported below will conﬁrm this positive effect on classiﬁcation accuracy, and two additional experiments (Section 5) will give us some ﬁrst qualitative conﬁrmation that the learned representations show the expected properties.

The reminder of the paper is structured as follows. In Section 2 we provide a general formulation of a DNN. Based on this formulation we introduce Deep LDA, a non-linear extension to classic LDA in Section 3. In Section 4 we experimentally evaluate our approach on three benchmark datasets. Section 5 provides a deeper insight into the structure of Deep LDA s internal represenations. In Section 6 we conclude the paper.

2 DEEP NEURAL NETWORKS

As the proposed model is built on top of a DNN we brieﬂy describe the training paradigm of a network used for classiﬁcation problems such as object recognition.

A neural network with P hidden layers is represented as a non-linear function f(Θ) with model parameters Θ = {Θ1, ..., ΘP }. In the supervised setting we are additionally given a set of N train samples x1, ...x N along with corresponding classiﬁcation targets t1, ...t N {1, ..., C}. We further assume that the network output pi = (pi,1, ..., pi,C) = f(xi, Θ) is normalized by the softmaxfunction to obtain class (pseudo-)probabilities. The network is then optimized using Stochastic Gradient Descent (SGD) with the goal of ﬁnding an optimal model parametrization Θ with respect to a certain loss function li(Θ) = l(f(xi, Θ), ti).

Θ = arg min Θ

i=1 li(Θ) (1)

For multi-class classiﬁcation problems, Categorical-Cross-Entropy (CCE) is a commonly used optimization target and formulated for observation xi and target label ti as follows

j=1 yi,jlog(pi,j) (2)

where yi,j is 1 if observation xi belongs to class ti (j = ti) and 0 otherwise. In particular, the CCE tries to maximize the likelihood of the target class ti for each of the individual training examples xi under the model with parameters Θ. Figure 1a shows a sketch of this general network architecture.

We would like to emphasize that objectives such as CCE do not impose any direct constraints such as linear separability on the latent space representation.

Published as a conference paper at ICLR 2016

(a) The output of the network gets normalized by a soft max layer to form valid probabilities. The CCE objective maximizes the likelihood of the target class under the model.

(b) On the topmost hidden layer we compute an LDA which produces corresponding eigenvalues. The optimization target is to maximize those eigenvalues.

Figure 1: Schematic sketch of a DNN and Deep LDA. For both architectures the input data is ﬁrst propagated through the layers of the DNN. However, the ﬁnal layer and the optimization target are different.

3 DEEP LINEAR DISCRIMINANT ANALYSIS (DEEPLDA)

In this section we ﬁrst provide a general introduction to LDA. Based on this introduction we propose Deep LDA, which optimizes an LDA-based optimization target in an end-to-end DNN fashion. Finally we describe how Deep LDA is used to predict class probabilities of unseen test samples.

3.1 LINEAR DISCRIMINANT ANALYSIS

Let x1, ..., x N = X RN d denote a set of N samples belonging to C different classes c {1, ..., C}. The input representation X can either be hand engineered features, or hidden space representations H produced by a DNN (Andrew et al., 2013). LDA seeks to ﬁnd a linear projection A Rl d into a lower l-dimensional subspace L where l = C 1. The resulting linear combinations of features xi AT are maximally separated in this space (Fisher, 1936). The LDA objective to ﬁnd projection matrix A is formulated as:

|ASb AT | |ASw AT | (3)

where Sb is the between scatter matrix and deﬁned via the total scatter matrix St and within scatter matrix Sw as Sb = St Sw. Sw is deﬁned as the mean of the C individual class covariance matrices Sc (Equation (4) and (5)). Xc = Xc mc are the mean-centered observations of class c with per-class mean vector mc ( X is deﬁned analogously for the entire population X). The total scatter matrix St is the covariance matrix over the entire population of observations X.

Sc = 1 Nc 1 XT c Xc (4)

St = 1 N 1 XT X (6)

The linear combinations that maximize the objective in Equation (3) maximize the ratio of betweenand within-class scatter also reffered to as separation. This means in particular that a set of projected observations of the same class show low variance, whereas the projections of observations of different classes have high variance in the resulting space L. To ﬁnd the optimum solution for Equation (3) one has to solve the general eigenvalue problem Sbe = v Swe. The projection matrix A is the set of eigenvectors e associated with this problem. In the following sections we will cast LDA as an objective function for DNN.

Published as a conference paper at ICLR 2016

3.2 DEEPLDA MODEL CONFIGURATION

Figure 1b shows a schematic sketch of Deep LDA. Instead of sample-wise optimization of the CCE loss on the predicted class probabilities (see Section 2) we put an LDA-layer on top of the DNN. This means in particular that we do not penalize the misclassiﬁcation of individual samples. Instead we try to produce features that show a low intra-class and high inter-class variability. We address this maximization problem by a modiﬁed version of the general LDA eigenvalue problem proposed in the following section. In contrast to CCE, Deep LDA optimization operates on the properties of the distribution parameters of the hidden representation produced by the neural net. As eigenvalue optimization is tied to its corresponding eigenvectors (a linear projection matrix), Deep LDA can be also seen as a special case of a dense layer.

3.3 MODIFIED DEEPLDA OPTIMIZATION TARGET

Based on Section 3.1 we reformulate the LDA objective to be suitable for a combination with deep learning. As already discussed by Stuhlsatz et al. (2012) and Lu et al. (2005) the estimation of Sw overemphasises high eigenvalues whereas small eigenvalues are estimated as too low. To weaken this effect, Friedman (1989) proposed to regularize the within scatter matrix by adding a multiple of the identity matrix Sw + λI. Adding the identity matrix has the second advantage of stabilizing small eigenvalues. The resulting eigenvalue problem is then formulated as

Sbei = vi(Sw + λI)ei (7)

where e = e1, ..., e C 1 are the resulting eigenvectors and v = v1, ...v C 1 the corresponding eigenvalues. Once the problem is solved, each eigenvalue vi quantiﬁes the amount of discriminative variance (separation) in direction of the corresponding eigenvector ei. If one would like to combine this objective with a DNN the optimization target would be the maximization of the individual eigenvalues. In particular, we expect that maximizing the individual eigenvalues which reﬂect the separation in the respective eigenvector directions leads to a maximization of the discriminative power of the neural net. In our initial experiments we started to formulate the objective as:

One problem we discovered with the objective in Equation (8) is that the net favours trivial solutions e. g. maximize only the largest eigenvalue as this produces the highest reward. In terms of classiﬁcation this means that it maximizes the distance of classes that are already separated at the expense of potentially non-separated neighbouring classes. This was already discussed by (Stuhlsatz et al., 2012) and tackled by a weighted computation of the between scatter matrix Sb.

We propose a different solution to this problem and address it by focusing our optimization on the smallest of all C 1 available eigenvalues. In particular we consider only the k eigenvalues that do not exceed a certain threshold for variance maximization:

i=1 vi with {v1, ..., vk} = {vj|vj < min{v1, ..., v C 1} + ϵ} (9)

The intuition behind this formulation is to learn a net parametrization that pushes as much discriminative variance as possible into all of the C 1 available feature dimensions.

We would like to underline that this formulation allows to train Deep LDA networks with backpropagation in end-to-end fashion (see Appendix for a derivative of the loss functions s gradient). Our models are optimized with the Nesterov momentum version of mini-batch SGD. Related methods already showed that mini-batch learning on distribution parameters (in this case covariance matrices) is feasible if the batch-size is sufﬁciently large to be representative for the entire population (Wang et al., 2015a;b).

3.4 CLASSIFICATION BY DEEPLDA

This section describes how the most likely class label is assigned to an unseen test sample xt once the network is trained and parametrized. In a ﬁrst step we compute the topmost hidden representation

Published as a conference paper at ICLR 2016

H on the entire training set X. On this hidden representation we compute the LDA as described in Section 3.1 and 3.3 producing the corresponding eigenvectors e = {ei}C 1 i=1 which form the LDA projection matrix A. We would like to emphasize that since the parameters of the network are ﬁxed at this stage we make use of the entire training set to provide a stable estimate of the LDA projection. Based on A and the per-class mean hidden representations Hc = ( h T 1 , ..., h T C) the distances of sample ht to the linear decision hyperplanes (Friedman et al., 2001) are deﬁned as

d = h T t TT 1

2diag Hc TT with T = Hc AAT (10)

where T are the decision hyperplane normal vectors. The subtracted term is the bias of the decision functions placing the decision boundaries in between the means of the respective class hidden representations (no class priors included). The vector of class probabilities for test sample xt is then computed by applying the logistic function p c = 1/(1 + e d) and further normalized by pc = p c/ P p i to sum to one. Finally we assign class i with highest probability as arg maxi pi to the unseen test sample xt.

4 EXPERIMENTS

In this section we present an experimental evaluation of Deep LDA on three benchmark data sets namely MNIST, CIFAR-10 and STL-10 (see Figure 2 for some sample images). We compare the results of Deep LDA with the CCE based optimization target as well as the present state of the art of the respective datasets. In addition, we provide details on the network architectures, hyper parameters and respective training/optimization approaches used in our experiments.

Figure 2: Example images of evaluation data sets (a)(b) MNIST, (c)(d) CIFAR-10, (e)(f) STL-10. The relative size differences between images from the three data sets are kept in this visualization.

4.1 EXPERIMENTAL SETUP

The general structure of the networks is similar for all of the three datasets and identical for CIFAR10 and STL-10. The architecture follows the VGG model with sequences of 3 3 convolutions (Simonyan & Zisserman, 2014). Instead of a dense classiﬁcation layer we use global average pooling on the feature maps of the last convolution layer (Lin et al., 2013). We picked this architecture as it leads to well-posed problems for covariance estimation: many samples vs. low feature space dimension. We further apply batch normalization (Ioffe & Szegedy, 2015) after each convolutional layer which (1) helped to increase convergence speed and (2) improved the performance of all our models. Batch normalization has a positive effect on both CCE as well as Deep LDA-based optimization. In Table 1 we outline the structure of our models in detail. All networks are trained using SGD with Nesterov momentum. The initial learning rate is set to 0.1 and the momentum is ﬁxed at 0.9 for all our models. The learning rate is then halved every 25 epochs for CIFAR-10 and STL-10 and every 10 epochs for MNIST. For further regularization we add weight decay with a weighting of 0.0001 on all trainable parameters of the models. The between-class covariance matrix regularization weight λ (see Section 3.3) is set to 0.001 and the ϵ-offset for Deep LDA to 1.

One hyper-parameter that varies between the datasets is the batch size used for training Deep LDA. Although a large batch size is desired to get stable covariance estimates it is limited by the amount of memory available on the GPU. The mini-batches for Deep LDA were for MNIST: 1000, for CIFAR10: 1000 and for STL-10: 200. For CCE training, a batch size of 128 is used for all datasets. The models are trained on an NVIDIA Tesla K40 with 12GB of GPU memory.

Published as a conference paper at ICLR 2016

Table 1: Model Speciﬁcations. BN: Batch Normalization, Re Lu: Rectiﬁed Linear Activation Function, CCE: Categorical Cross Entropy. The mini-batch sizes of Deep LDA are: MNIST(1000), CIFAR-10(1000), STL-10(200). For CCE training a constant batch size of 128 is used.

CIFAR-10 and STL-10 MNIST

Input 3 32 32 (96 96) Input 1 28 28 3 3 Conv(pad-1)-64-BN-Re Lu 3 3 Conv(pad-1)-64-BN-Re Lu 2 2 Max-Pooling + Drop-Out(0.25) 3 3 Conv(pad-1)-128-BN-Re Lu 3 3 Conv(pad-1)-96-BN-Re Lu 3 3 Conv(pad-1)-128-BN-Re Lu 3 3 Conv(pad-1)-96-BN-Re Lu 2 2 Max-Pooling + Drop-Out(0.25) 2 2 Max-Pooling + Drop-Out(0.25) 3 3 Conv(pad-1)-256-BN-Re Lu 3 3 Conv(pad-1)-256-BN-Re Lu 3 3 Conv(pad-1)-256-BN-Re Lu 3 3 Conv(pad-1)-256-BN-Re Lu 2 2 Max-Pooling + Drop-Out(0.25)

3 3 Conv(pad-0)-1024-BN-Re Lu 3 3 Conv(pad-0)-256-BN-Re Lu Drop-Out(0.5) Drop-Out(0.5) 1 1 Conv(pad-0)-1024-BN-Re Lu 1 1 Conv(pad-0)-256-BN-Re Lu Drop-Out(0.5) Drop-Out(0.5) 1 1 Conv(pad-0)-10-BN-Re Lu 1 1 Conv(pad-0)-10-BN-Re Lu 2 2 (10 10) Global-Average-Pooling 5 5 Global-Average-Pooling Soft-Max with CCE or LDA-Layer

4.2 EXPERIMENTAL RESULTS

We describe the benchmark datasets as well as the pre-processing and data augmentation used for training. We present our results and relate them to the present state of the art for the respective dataset. As Deep LDA is supposed to produce a linearly separable feature space, we also report the results of a linear Support Vector Machine trained on the latent space of Deep LDA (tagged with Lin SVM). The results of our network architecture trained with CCE are marked as Our Net CCE. To provide a complete picture of our experimental evaluation we also show classiﬁcation results of an LDA on the topmost hidden representation of the networks trained with CCE (tagged with Our Net CCE(LDA)).

4.2.1 MNIST

The MNIST dataset consists of 28 28 gray scale images of handwritten digits ranging from 0 to 9. The dataset is structured into 50000 train samples, 10000 validation samples and 10000 test samples. For training we did not apply any pre-processing nor data augmentation. We present results for two different scenarios. In scenario MNIST-50k we train on the 50000 train samples and use the validation set to pick the parametrization which produces the best results on the validation set. In scenario MNIST-60k we train the model for the same number of epochs as in MNIST-50k but also use the validation set for training. Finally we report the accuracy of the model on the test set after the last training epoch. This approach was also applied in (Lin et al., 2013) which produce state of the art results on the dataset.

Table 2 summarizes all results on the MNIST dataset. Deep LDA produces competitive results having a test set error of 0.29% although no data augmentation is used. In the approach described in (Graham, 2014) the train set is extended with translations of up to two pixels. We also observe that a linear SVM trained on the learned representation produces comparable results on the test set. It is also interesting that early stopping with best-model-selection (MNIST-50k) performs better than training on MNIST-60k even though 10000 more training examples are available.

Published as a conference paper at ICLR 2016

Table 2: Comparison of test errors on MNIST

Method Test Error NIN + Dropout (Lin et al. (2013)) 0.47% Maxout (Goodfellow et al. (2013)) 0.45% Deep CNet(5,60) (Graham (2014)) 0.31% (train set translation) Our Net CCE(LDA)-50k 0.39% Our Net CCE-50k 0.37% Our Net CCE-60k 0.34% Deep LDA-60k 0.32% Our Net CCE(LDA)-60k 0.30% Deep LDA-50k 0.29% Deep LDA-50k(Lin SVM) 0.29%

Table 3: Comparison of test errors on CIFAR-10

Method Test Error NIN + Dropout (Lin et al. (2013)) 10.41% Maxout (Graham (2014)) 9.38% NIN + Dropout (Lin et al. (2013)) 8.81% (data augmentation) Deep CNINet(5,300) (Graham (2014)) 6.28% (data augmentation) Deep LDA(Lin SVM) 7.58% Deep LDA 7.29% Our Net CCE(LDA) 7.19% Our Net CCE 7.10%

4.2.2 CIFAR-10

The CIFAR-10 dataset consists of tiny 32 32 natural RGB images containing samples of 10 different classes. The dataset is structured into 50000 train samples and 10000 test samples. We pre-processed the dataset using global contrast normalization and ZCA whitening as proposed by Goodfellow et al. (2013). During training we only apply random left-right ﬂips on the images no additional data augmentation is used. In training, we follow the same procedure as described for the MNIST dataset above to make use of the entire 50000 train images.

Table 3 summarizes our results and relates them to the present state of the art. Both Our Net CCE and Deep LDA produce state of the art results on the dataset when no data augmentation is used. Although Deep LDA performs slightly worse than CCE it is capable of producing competitive results on CIFAR-10.

4.2.3 STL-10

Like CIFAR-10, the STL-10 data set contains natural RGB images of 10 different object categories. However, with 96 96 pixels the size of the images is larger, and the training set is considerably smaller, with only 5000 images. The test set consists of 8000 images. In addition, STL-10 contains 100000 unlabelled images but we do not make use of this additional data at this point as our approach is fully supervised. For that reason we ﬁrst perform an experiment (Method-4k) where we do not follow the evaluation strategy described in (Coates et al., 2011), where models are trained on 1000 labeled and 100000 unlabeled images. Instead, we directly compare CCE and Deep LDA in a fully supervised setting. As with MNIST-50k we train our models on 4000 of the train images and use the rest (1000 images) as a validation set to pick the best performing parametrization. The results on the Method-4k-Setting of STL-10 are presented in the top part of Table 4. Our model trained with CCE achieves an accuracy of 78.39%. The same architecture trained with Deep LDA improves the test set accuracy by more than 3 percentage points and achieves 81.46%. In our second experiment (Method-1k) we follow the evaluation strategy described in (Coates et al., 2011) but without using the unlabelled data. We train our models on the 10 pre-deﬁned folds (each fold contains 1000 train images) and report the average accuracy on the test set. The model optimized with CCE

Published as a conference paper at ICLR 2016

Table 4: Comparison of test set accuracy on a purely supervised setting of STL-10. (Method-4k: 4000 train images, Method-1k: 1000 train images.)

Method-4k Test Accuracy-4k Our Net CCE(LDA)-4k 78.50% Our Net CCE-4k 78.84% Deep LDA-4k 81.16% Deep LDA(Lin SVM)-4k 81.40%

Method-1k Test Accuracy-1k SWWAE (Zhao et al. (2015)) 57.45% SWWAE (Zhao et al. (2015)) 74.33% (semi-supervised) Deep LDA(Lin SVM)-1k 55.92% Our Net CCE-1k 57.44% Our Net CCE(LDA)-1k 59.48% Deep LDA-1k 66.97%

(Our Net CCE-1k) achieves 57.44% accuracy on the test set which is in line with the supervised results reported in (Zhao et al., 2015).

Our model trained with Deep LDA achieves 66.97% average test set accuracy. This is a performance gain of 9.53% in contrast to CCE and it shows that the advantage of Deep LDA compared to CCE becomes even more apparent when the amount of labeled data is low. When comparing Deep LDA-1k with LDA applied on the features computed by a network trained with CCE (Our Net CCE(LDA)- 1k, 59.48%), we ﬁnd that the end-to-end trained LDA-features outperform the standard CCE approach. A direct comparison with state of the art results as reported in (Zhao et al., 2015; Swersky et al., 2013; Dosovitskiy et al., 2014) is not possible because these models are trained under semisupervised conditions using both unlabelled and labelled data. However, the results suggest that a combination of Deep LDA with methods such as proposed by Zhao et al. (2015) is a very promising future direction.

5 INVESTIGATONS ON DEEPLDA AND DISCUSSIONS

In this section we provide deeper insights into the representations learned by Deep LDA. We experimentally investigate the eigenvalue structure of representations learned by Deep LDA as well as its relation to the classiﬁcation potential of the respective networks.

5.1 DOES IMAGE SIZE AFFECT DEEPLDA?

Deep LDA shows its best performance on the STL-10 dataset (Method-4k) where it outperforms CCE by 3 percentage points. The major difference between STL-10 and CIFAR-10 apart from the number of train images is the size of the contained images (see Figure 2 to get an impression of the size relations). To get a deeper insight into the inﬂuence of this parameter we run the following additional experiment: (1) we create a downscaled version of the STL-10 dataset with the same image dimensions as CIFAR-10 (32 32). (2) We repeat the experiment (Method-4k) described in Section 4.2.3 on the downscaled 32 32 dataset. The results are presented in Figure 3, as curves showing the evolution of train and validation accuracy during training. As expected, downscaling reduces the performance of both CCE and Deep LDA. We further observe that Deep LDA performs best when trained on larger images and has a disadvantage on the small images. However, a closer look at the results on CIFAR-10 (CCE: 7.10% error, Deep LDA: 7.29% error, see Table 3) suggests that this effect is compensated when the training set size is sufﬁciently large. As a reminder: CIFAR10 contains 50000 train images in contrast to STL-10 with only 4000 samples.

5.2 EIGENVALUE STRUCTURE OF DEEPLDA REPRESENTATIONS

Deep LDA optimization does not focus on maximizing the target class likelihood of individual samples. As proposed in Section 3 we encourage the net to learn feature representations with discrimi-

Published as a conference paper at ICLR 2016

0 100 200 300 400 500 Epoch

stl10_dlda_tr stl10_dlda_va stl10_cce_tr stl10_cce_va

(a) STL-10 (96 96)

0 100 200 300 400 500 Epoch

stl10_down32_dlda_tr stl10_down32_dlda_va stl10_down32_cce_tr stl10_down32_cce_va

(b) STL-10 (32 32)

Figure 3: Comparison of the learning curves of Deep LDA on the original STL-10 dataset (Method4k) with image size 96 96 and its downscaled 32 32 version.

native distribution parameters (within and between class scatter). We achieve this by exploiting the eigenvalue structure of the general LDA eigenvalue problem and use it as a deep learning objective. Figure 4a shows the evolution of train and test set accuracy of STL-10 along with the mean value of all eigenvalues in the respective training epoch. We observe the expected natural correlation between the magnitude of explained discriminative variance (separation) and the classiﬁcation potential of the resulting representation. In Figure 4b we show how the individual eigenvalues increase during training. Note that in Epoch 0 almost all eigenvalues (1-7) start at a value of 0. This emphasizes the importance of the design of our objective function (compare Equation (9)) which allows to draw discriminability into the lower dimensions of the eigen-space. In Figure 4c we additionally compare the eigenvalue structure of the latent representation produced by Deep LDA with CCE based training. Again results show that Deep LDA helps to distribute the discriminative variance more equally over the available dimensions. To give the reader an additional intuition on the learned representations we visualize the latent space of STL-10 in our supplemental materials on the ﬁnal page of this paper.

0 50 100 150 200 250 30 40 50 60 70 80 90 100

0 50 100 150 200 250 Epoch

mean(eigvals)

(a) Eigenvalues vs. Accuracy

0 2 4 6 8 10 Eigenvalue

Explained variance

(b) Individual Eigenvalues

0 2 4 6 8 10 Eigenvalue

Explained Discriminative Variance

(c) CCE vs. Deep LDA

Figure 4: The ﬁgure investigates the eigenvalue structure of the general LDA eigenvalue problem during training a Deep LDA network on STL-10 (Method-4k). (a) shows the evolution of classiﬁcation accuracy along with the magnitude of explained discriminative variance (separation) in the latent representation of the network. (b) shows the evolution of individual eigenvalues during training. In (c) we compare the eigenvalue structure of a net trained with CCE and Deep LDA (for better comparability we normalized the maximum eigenvalue to one).

6 CONCLUSION

We have presented Deep LDA, a deep neural network interpretation of linear discriminant analysis. Deep LDA learns linearly separable latent representations in an end-to-end fashion by maximizing the eigenvalues of the general LDA eigenvalue problem. Our modiﬁed version of the LDA optimization target pushes the network to distribute discriminative variance in all dimensions of the latent

Published as a conference paper at ICLR 2016

feature space. Experimental results show that representations learned with Deep LDA are discriminative and have a positive effect on classiﬁcation accuracy. Our Deep LDA models achieve competitive results on MNIST and CIFAR-10 and outperform CCE in a fully supervised setting of STL-10 by more than 9% test set accuracy. The results and further investigations suggest that Deep LDA performs best, when applied to reasonably-sized images (in the present case 96 96 pixel). Finally, we see Deep LDA as a speciﬁc instance of a general fruitful strategy: exploit well-understood machine learning or classiﬁcation models such as LDA with certain desirable properties, and use deep networks to learn representations that provide optimal conditions for these models.

ACKNOWLEDGMENTS

We would like to thank Sepp Hochreiter for helpful discussions, and the three anonymous reviewers for extremely helpful (and partly challenging) remarks. We would also like to thank all developers of Theano (Bergstra et al., 2010) and Lasagne (Dieleman et al., 2015) for providing such great deep learning frameworks. The research reported in this paper has been supported by the Austrian Ministry for Transport, Innovation and Technology, the Federal Ministry of Science, Research and Economy, and the Province of Upper Austria in the frame of the COMET center SCCH. The Tesla K40 used for this research was donated by the NVIDIA Corporation.

Andrew, Galen, Arora, Raman, Bilmes, Jeff, and Livescu, Karen. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on Machine Learning, pp. 1247 1255, 2013.

Bergstra, James, Breuleux, Olivier, Bastien, Fr ed eric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientiﬁc Computing Conference (Sci Py), June 2010. Oral Presentation.

Chan, Tsung-Han, Jia, Kui, Gao, Shenghua, Lu, Jiwen, Zeng, Zinan, and Ma, Yi. Pcanet: A simple deep learning baseline for image classiﬁcation? IEEE Transactions on Image Processing, 24(12): 5017 5032, 2015. doi: 10.1109/TIP.2015.2475625. URL http://dx.doi.org/10.1109/ TIP.2015.2475625.

Clevert, Djork-Arn e, Unterthiner, Thomas, Mayr, Andreas, Ramsauer, Hubert, and Hochreiter, Sepp. Rectiﬁed factor networks. In Advances in neural information processing systems, 2015.

Coates, Adam, Ng, Andrew Y, and Lee, Honglak. An analysis of single-layer networks in unsupervised feature learning. In International conference on artiﬁcial intelligence and statistics, pp. 215 223, 2011.

de Leeuw, Jan. Derivatives of generalized eigen systems with applications. 2007.

Dieleman, Sander, Schlueter, Jan, Raffel, Colin, Olson, Eben, Snderby, Sren Kaae, Nouri, Daniel, Maturana, Daniel, Thoma, Martin, Battenberg, Eric, Kelly, Jack, Fauw, Jeffrey De, Heilman, Michael, diogo149, Mc Fee, Brian, Weideman, Hendrik, takacsg84, peterderivaz, Jon, instagibbs, Rasul, Dr. Kashif, Cong Liu, Britefury, and Degrave, Jonas. Lasagne: First release., August 2015. URL http://dx.doi.org/10.5281/zenodo.27878.

Dosovitskiy, Alexey, Springenberg, Jost Tobias, Riedmiller, Martin, and Brox, Thomas. Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 766 774, 2014.

Fisher, Ronald A. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7 (2):179 188, 1936.

Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert. The elements of statistical learning, volume 1. Springer series in statistics Springer, Berlin, 2001.

Friedman, Jerome H. Regularized discriminant analysis. Journal of the American statistical association, 84(405):165 175, 1989.

Published as a conference paper at ICLR 2016

Goodfellow, Ian J, Warde-Farley, David, Mirza, Mehdi, Courville, Aaron, and Bengio, Yoshua. Maxout networks. ar Xiv preprint ar Xiv:1302.4389, 2013.

Graham, Benjamin. Spatially-sparse convolutional neural networks. Co RR, abs/1409.6070, 2014. URL http://arxiv.org/abs/1409.6070.

Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Co RR, abs/1502.03167, 2015. URL http://arxiv.org/ abs/1502.03167.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012.

Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network in network. Co RR, abs/1312.4400, 2013. URL http://arxiv.org/abs/1312.4400.

Lu, Juwei, Plataniotis, Konstantinos N, and Venetsanopoulos, Anastasios N. Regularization studies of linear discriminant analysis in small sample size scenarios with application to face recognition. Pattern Recognition Letters, 26(2):181 191, 2005.

Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

Stuhlsatz, Andre, Lippel, Jens, and Zielke, Thomas. Feature extraction with deep neural networks by a generalized discriminant analysis. IEEE Transactions on Neural Networks and Learning Systems, 23(4):596 608, 2012.

Swersky, Kevin, Snoek, Jasper, and Adams, Ryan P. Multi-task bayesian optimization. In Advances in Neural Information Processing Systems, pp. 2004 2012, 2013.

Wang, Weiran, Arora, Raman, Livescu, Karen, and Bilmes, Jeff. On deep multi-view representation learning. In ICML, 2015a.

Wang, Weiran, Arora, Raman, Livescu, Karen, and Bilmes, Jeff A. Unsupervised learning of acoustic features via deep canonical correlation analysis. In Proceedings of ICASSP, 2015b.

Zhao, Junbo, Mathieu, Michael, Goroshin, Ross, and Lecun, Yann. Stacked what-where autoencoders. ar Xiv preprint ar Xiv:1506.02351, 2015.

Published as a conference paper at ICLR 2016

APPENDIX A: GRADIENT OF DEEPLDA-LOSS

To train with back-propagation we provide the partial derivatives of optimization target l(H) proposed in Equation (9) with respect to the topmost hidden representation H (contains samples as rows and features as columns). As a reminder, the Deep LDA objective focuses on maximizing the k smallest eigenvalues vi of the generalized LDA eigenvalue problem. In particular, we consider only the k eigenvalues that do not exceed a certain threshold for optimization:

i=1 vi with {v1, ..., vk} = {vj|vj < min{v1, ..., v C 1} + ϵ} (11)

For convenience, we change the subscripts of the scatter matrices to superscripts in this section (e.g. St St). St ij addresses the element in row i and column j in matrix St. Starting from the formulation of the generalized LDA eigenvalue problem:

Sbei = vi Swei (12)

the derivative of eigenvalue vi with respect to hidden representation H is deﬁned in (de Leeuw, 2007) as: vi H = e T i

Recalling the deﬁnitions of the LDA scatter matrices from Section 3.1:

Sc = 1 Nc 1 XT c Xc Sw = 1

St = 1 N 1 XT X Sb = St Sw (15)

we can write the partial derivative of the total scatter matrix St (Andrew et al., 2013; Stuhlsatz et al., 2012) on hidden representation H as:

St ab Hij =

2 N 1 Hij 1

n Hnj if a = j, b = j 1 N 1 Hib 1

n Hnb if a = j, b = j 1 N 1 Hia 1

n Hna if a = j, b = j 0 if a = j, b = j

The derivatives for the individual class covariance matrices Sc are deﬁned analogously to Equation (16) for the C classes and we can write the partial derivatives of Sw and Sb with respect to the latent representation H as:

Sw ab Hij = 1

Sc ab Hij and Sb ab Hij = St ab Hij Sw ab Hij (17)

The partial derivative of the loss function introduced in Section 3.3 with respect to hidden state H is then deﬁned as:

Published as a conference paper at ICLR 2016

APPENDIX B: DEEPLDA LATENT REPRESENTATION

Figure 5 shows the latent space representations on the STL-10 dataset (Method-4k) as n-to-n scatter plots of the latent features on the ﬁrst 1000 test set samples. We plot the test set samples after projection into the C 1 dimensional Deep LDA feature space. The plot suggest that Deep LDA makes use of all available feature dimensions. An interesting observation is that many of the internal representations are orthogonal to each other (which is an implication of LDA). This of course favours linear decision boundaries.

Figure 5: STL-10 latent representation produced by Deep LDA (n-to-n scatter plots of the latent features of the ﬁrst 1000 test set samples. e.g.: top left plot: latent feature 1 vs. latent feature 2).