# discriminative_analysis_dictionary_learning__3f584be7.pdf

Discriminative Analysis Dictionary Learning

Jun Guo1 , Yanqing Guo1, Xiangwei Kong1, Man Zhang2, and Ran He2,3

1 School of Information and Communication Engineering, Dalian University of Technology, Dalian 116024, China 2 The Center for Research on Intelligent Perception and Computing, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China 3 CAS Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, Beijing 100190, China guojun@mail.dlut.edu.cn, {guoyq,kongxw}@dlut.edu.cn, {zhangman,rhe}@nlpr.ia.ac.cn

Dictionary learning (DL) has been successfully applied to various pattern classiﬁcation tasks in recent years. However, analysis dictionary learning (ADL), as a major branch of DL, has not yet been fully exploited in classiﬁcation due to its poor discriminability. This paper presents a novel DL method, namely Discriminative Analysis Dictionary Learning (DADL), to improve the classiﬁcation performance of ADL. First, a code consistent term is integrated into the basic analysis model to improve discriminability. Second, a tripletconstraint-based local topology preserving loss function is introduced to capture the discriminative geometrical structures embedded in data. Third, correntropy induced metric is employed as a robust measure to better control outliers for classiﬁcation. Then, half-quadratic minimization and alternate search strategy are used to speed up the optimization process so that there exist closed-form solutions in each alternating minimization stage. Experiments on several commonly used databases show that our proposed method not only signiﬁcantly improves the discriminative ability of ADL, but also outperforms state-of-the-art synthesis DL methods.

1 Introduction The success of sparse representation (SR) pushes forward the research of dictionary learning (DL). In SR, a desired dictionary learned from data often outperforms a set of predeﬁned bases in pattern classiﬁcation tasks. One popular line of research in DL aims to learn a synthesis dictionary with speciﬁc promotion functions. Synthesis dictionary learning is widely used but time-consuming. Hence, as its dual model, analysis model has drawn much attention recently (Rubinstein, Bruckstein, and Elad 2010). Analysis dictionary learning (ADL) aims to learn a transformation instead of utilizing off-the-shelf transformations like FFT, DCT, etc, in such a way that the resulting presentation of signal is sparse (Shekhar, Patel, and Chellappa 2014). In recent years, some ADL methods have been developed. Ravishankar and Bresler (Ravishankar and Bresler 2013) proposed well-conditioned square transformations for image denoising, while Shekhar et al. (Shekhar, Patel, and Chellappa 2014) enhanced this method by imposing a fullrank constraint on the analysis dictionary. Rubinstein and

Copyright c 2016, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Elad (Rubinstein and Elad 2014) imposed a hard thresholding operator on the analysis codes leading to sparse representations for synthesis reconstruction. Gu et al. (Gu et al. 2014) combined analysis class-speciﬁc dictionary and synthesis class-speciﬁc dictionary for classiﬁcation. These aforementioned methods beneﬁt from the simpler optimization for training and the higher speed for testing in ADL. However, to the best of our knowledge, there are few works to address the discriminability of ADL. This may be because of its poor discriminability for pattern classiﬁcation tasks. Inspired by the signiﬁcant attempts of synthesis dictionary learning, we further integrate structure preserving and discriminative characters into the basic analysis model. We simply utilize the classical k Nearest Neighbor (k NN) classiﬁer that performs exceptionally well without training effort. Nevertheless, it critically relies on the local topological structures (known as topology property in (Luo et al. 2011)) as well as the discriminability of data. Hence, it is extremely important to simultaneously exploit the underlying geometrical structures and discriminability of data when applying k NN classiﬁer to DL-based classiﬁcation tasks. Main contributions of this paper are as follows:

We explicitly introduce the discriminative information into the analysis dictionary learning framework via a code consistent term. Then, the learned analysis dictionary can exploit the discriminability of data instead of merely well representing data.

We employ triplet constraints to capture the underlying discriminative local structures of data, resulting in a novel local topology preserving loss function. This loss function can preserve the relative neighborhood proximities in a supervised manner.

We utilize correntropy induced metric (CIM) as a robust measure to handle outliers and noise. Consequently, we develop an alternating optimization algorithm based on the alternate search strategy and half-quadratic (HQ) minimization.

Extensive comparison experiments well validate the encouraging gain in pattern classiﬁcation from our method, and demonstrate that analysis models can outperform synthesis models in pattern classiﬁcation tasks if discriminative information is well treated.

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

2 Preliminaries 2.1 Notation Summary Bold uppercase letters (U, V, ) stand for matrices. Bold lowercase letters (u, v, ) are vectors, while lowercase letters (u, v, ) are scalars. Tr (U), U 1 and UT denote the trace, inverse and transpose of U, respectively. Ui is the ith row of U, while U j presents the jth column of U. Uij means the jth element in the ith row of U. U F and U 0 denote the Frobenius norm (

i,j U2 ij) and l0 norm (number of nonzero entries), respectively. U V is the Hadamard product (element-wise multiplication) of two matrices with identical size. Moreover, 0, 1 and I denote the all zeros, all ones and identity matrix with appropriate sizes, respectively.

2.2 Dictionary Learning (DL) Y = [y1, y2, , yn] Rm1 n denotes the original data matrix. The core idea of DL is to learn an optimized dictionary which can effectively represent each sample yi Rm1. Let X = [x1, x2, , xn] Rm2 n be the coding coefﬁcients of Y over the learned dictionary. Synthesis dictionary learning: Based on classical synthesis sparse model, most existing DL methods aim to learn a synthesis dictionary D = [d1, d2, , dm2] Rm1 m2 by solving

min D,X Y DX 2 F s.t. D D, xi 0 T0, i = 1, 2, , n (1)

where Y DX 2 F stands for the reconstruction error. T0 is a positive integer that controls the sparsity level. D is a set of constraints on D for a well-regularized solution. In pattern classiﬁcation applications, there exist various speciﬁc regularizations added to the objective function, e.g., structured incoherence of dictionary (Ramirez, Sprechmann, and Sapiro 2010), transform-invariance of dictionary (Zhang et al. 2014), joint dictionary learning and subspace clustering (Zhang, He, and Davis 2014). Analysis dictionary learning: As a dual analysis viewpoint of the commonly used synthesis dictionary learning, analysis dictionary learning (ADL) gives an intuitive explanation like feature transformation (e.g. DWT). It aims to learn an analysis dictionary Ω Rm2 m1 by solving

min Ω,X X ΩY 2 F s.t. Ω W, xi 0 T0, i = 1, 2, , n (2)

where W is a set of constraints on Ω to make the solution non-trivial. As indicated in (Shekhar, Patel, and Chellappa 2014), W can be matrices with either relatively small Frobenius norm or unity row-wise norm. However, this model has poor discriminability for pattern classiﬁcation. Inspired by the meaningful attempts of conventional synthesis DL methods for classiﬁcation tasks, we integrate two signiﬁcant functions with (2) to simultaneously exploit the discriminative geometrical structures of Y and promote the discriminability of X. We postpone the detailed discussion of our design until next section.

2.3 Correntropy Different from mean square error (MSE), a global similarity measure, correntropy is a local measure between two variables, which is more robust to outliers. Correntropy is directly related to Renyi s quadratic entropy in which Parzen windowing (a non-parametric estimation method) is employed to estimate the data s probability distribution. Non-Parametric Renyi s Entropy: Renyi s quadratic entropy is often used to measure how regular a data set is. Suppose that all yis in the aforementioned data set Y are independently and identically drawn from the probability density function p(y). A non-parametric estimator of Y s Renyi s quadratic entropy can be calculated as

HR (Y) = log

j Kσ (yi, yj), (3)

where Kσ (yi, yj) = exp yi yj 2 2 /σ2 is Gaussian kernel density estimation in the Parzen windowing method, Following similar arguments, we compute Renyi s crossentropy between two sets Y and Y (Yuan and Hu 2009).

HR (Y; Y ) = log

j Kσ yi, y j (4)

Correntropy Induced Metric (CIM): Based on the above concepts and a ﬁnite number of data {(yi, y i)}n i=1, the correntropy of (Y, Y ) is deﬁned as

ˆVσ (Y, Y ) = 1

i=1 Kσ (yi, y i) (5)

It is obvious that the value of correntropy is primarily decided by the Gaussian kernel function along the line Y = Y . Liu et al. (Liu, Pokharel, and Principe 2007) extended the concept of correntropy for Correntropy Induced Metric (CIM), which is a general similarity measure between any two vectors (u, v) of the same length. It is deﬁned as

CIM (u, v) = [Kσ (0) Kσ (u, v)]1/2

= 1 exp u v 2 2 /σ2 1/2 (6)

CIM is proved to obey all the properties for a distance metric, such as nonnegativity, identity of indiscernibles, symmetry and triangle inequality. Hence, correntropy induced metric can replace other commonly used distance metrics. The robustness and effectiveness of correntropy have been veriﬁed in principal component analysis (He et al. 2011), feature selection (He et al. 2012), subspace clustering (Lu et al. 2013), and sparse representation (He et al. 2014). We employ this concept in our paper since it contributes to classiﬁcation by well controlling outliers. To the best of our knowledge, there is no such work for analysis dictionary based pattern classiﬁcation tasks.

3 The Proposed DADL Method In this section, we introduce our proposed method, which incorporates the original data s discriminative information and local topological property into a uniﬁed analysis dictionary learning framework.

3.1 Code Consistent Term To introduce the discriminative information into analysis dictionary learning, we design a sufﬁciently sparse matrix H = [h1, h2, , hn] Rm2 n as target codes. Each column of H is a sparse vector hi, which acts as the desired form of sparse code xi and also indicates the class label of yi. Then, a code consistent term X H 2 F is added to (2), which can fully exploit the discrimination embedded in data. This code consistent term enforces samples from the same category to have similar sparse codes. There are two main ways used to generate target codes, i.e. unsupervised and supervised manners. For the unsupervised design approach, the sparse codes for each class should be unique. Random binary codes, Hadamard codes (Yang et al. 2015), unsupervised version of Iterative Quantization are good alternatives. For the supervised design approach, the target output of multi-class linear regression is a nice choice, which is also the spectral matrix in linear discriminant analysis (LDA). In the spectral matrix, code vectors for any class have a similar form: [0, ..., 0, 1, , 0..., 0]T , whose non-zero position indicates the category which is obviously sparse. However, this kind of target codes have a uniﬁed length same as the total number of classes. For analysis dictionary learning, the length of each sparse code is often larger than the number of classes. To solve this problem, we skillfully employ the Kronecker product to generate high-dimensional codes. The Kronecker product of the original spectral codes and an all ones vector with appropriate size is named Kron-form spectral codes. In this paper, we focus on the discriminability of analysis dictionary learning rather than how to choose better target codes. Therefore, for simplicity, we generate longer sparse target codes by concatenating the Kron-form spectral codes to the sequency Walsh-ordered Hadamard codes.

3.2 Local Topology Preserving Loss Function Local topology property describes data s local structures. Besides the neighborhood relationships, it emphasizes the ranking information of each data point s neighbors. The relative neighborhood proximities as well as the discriminability of data play a vital role in k NN classiﬁcation. Different from conventional unsupervised similarity preserving loss functions based on pairwise/doublet constraints, we propose a local topology preserving loss function via triplet constraints in a supervised manner. Deﬁnition 1. (Luo et al. 2011) Let (yi, yu, yv) be a triplet comprised of yi and its neighbors yu and yv. Their corresponding codes also form a triplet (xi, xu, xv). dist ( , ) is a function that returns the pairwise distance of inputs. Then, a coding process is called local topology preserving when the following condition holds: if dist (yi, yu) dist (yi, yv), then dist (xi, xu) dist (xi, xv).

Based on Deﬁnition 1, determining appropriate {xu, xv} for xi is identical to optimize max xu,xv Ai (u, v) [dist (xi, xu) dist (xi, xv)] , (7)

where Ai is an antisymmetric matrix whose (u, v)th element equals dist (yi, yu) dist (yi, yv).

However, (7) is an unsupervised type. In consideration of each sample s category, we further develop a supervised type loss by replacing Ai with A i in (7). The (u, v)th element of A i is deﬁned as

A i (u, v) Δ=

Ai (u, v) sign [Ai (u, v)] , hi = hu = hv Ai (u, v) sign [Ai (u, v)] , hi = hv = hu Ai (u, v) , otherwise (8)

where sign (a) =

1 , a < 0 0 , a = 0 +1 , a > 0 is the sign function. It is

obvious that A i is also an antisymmetric matrix. We obtain the supervised local topology preserving loss function

v=1 A i (u, v) [dist (xi, xu) dist (xi, xv)].

Proposition 1. Let W Rn n be a weighting matrix whose (i, j)th element equals n u=1 A i (u, j). Objective (9) is equivalent to min X n i=1 n j=1 Wijdist (xi, xj).

Proof. Recall that A i is an antisymmetric matrix, so we obtain A i (u, v) = A i (v, u). Then (9) is equivalent to

v=1 A i (v, u) dist (xi, xu)

u=1 A i (u, v) dist (xi, xv)

which can also be written as

u=1 Wiudist (xi, xu)

v=1 Wivdist (xi, xv)

The ultimate form min X n i=1 n j=1 Wijdist (xi, xj) can be easily derived from (11).

To simultaneously preserve neighborhood ranking information as well as neighborhood relationship in a supervised manner, we employ (12) to calculate Wij.

yu Ni A i (u, j) , yj Ni

0 , otherwise (12)

where Ni is a set containing the k nearest neighbors of yi. In practice, each non-zero Wij is normalized to [0, 1] by row. Considering the important role of Ω in coding process, we substitute xi with Ωyi in the loss function (9) so that Ω can be directly learned, which is demonstrated to be efﬁcient and effective in our experiments.

3.3 Correntropy Induced Objective Function

As indicated in (Chen and Principe 2012; Chen et al. 2014; 2015), correntropy can contribute to pattern classiﬁcation by

well controlling outliers. Hence, we apply CIM (6) as a robust metric and obtain the following correntropy induced objective function

min Ω,X J = J0 + λ1J1 + λ2J2

s.t. Ω W, xi 0 T0, i (13)

1 exp xi Ωyi 2 2 σ2

1 exp xi hi 2 2 σ2

Wij 1 exp Ωyi Ωyj 2 2 σ2 ,

(14) λ1 and λ2 are scalar constants which control the relative importance of corresponding terms.

4 Optimization 4.1 Half-quadratic Technique The problem (13) is not convex, so it is difﬁcult to optimize directly. Fortunately, the half-quadratic (HQ) technique can be employed to optimize this non-convex function by alternately minimizing its augmented function. According to the conjugate function theory and HQ theory (Nikolova and Ng 2005), we have Lemma 1. Suppose that f (z) is a function which satisﬁes the conditions listed in (Nikolova and Ng 2005), then for a ﬁxed z, there exists a dual potential function ϕ ( ), such that

f (z) = inf p R pz2 + ϕ (p) (15)

where p is an auxiliary variable determined by the minimizer function δ (z) w.r.t. f (z). As indicated in (Zhang et al. 2013; He, Tan, and Wang 2014), δ (z) = exp z2

σ2 when f (z) = 1 exp z2

That is to say, the inﬁmum of f (z) for a ﬁxed z can be reached at p = δ (z). According to Lemma 1, the augmented function ˆJ of (13) takes the following form

min Ω,X,P,Q,R ˆJ = ˆJ0 + λ1 ˆJ1 + λ2 ˆJ2

s.t. Ω W, xi 0 T0, i (16)

Pii xi Ωyi 2 2 σ2 + φi (Pii)

Qii xi hi 2 2 σ2 + ϕi (Qii)

Wij Rij Ωyi Ωyj 2 2 σ2 + Wijψij (Rij)

(17) The n n matrices P, Q and R store the auxiliary variables introduced by HQ. Note that P and Q are diagonal. {φi}n i=1, {ϕij}n i=1 and {ψij}n i,j=1 are conjugate functions.

4.2 Optimization Procedure

Based on the HQ optimization theory, ˆJ (Ω, X, P, Q, R) can be alternately minimized as follows: 1) Update the analysis dictionary and sparse codes. Ωt+1, Xt+1 = arg min Ω,X F = F0 + λ1F1 + λ2F2

s.t. Ω W, xi 0 T0, i (18) where

F0 = Tr (X ΩY) Pt(X ΩY)T

F1 = Tr (X H) Qt(X H)T

F2 = Tr ΩYLt+1YT ΩT . (19)

Lt+1 is the Laplacian matrix1 of the weighting matrix Wt+1

in the (t + 1)th iteration. The weighting matrix is updated as Wt+1 = Wt Rt. Let Ct+1 be a diagonal matrix whose

(i, i)th element equals n j=1 Wt+1 ij +Wt+1 ji 2 . We deﬁne the

Laplacian matrix Lt+1 Δ= Ct+1 Wt+1+(Wt+1) T

2 . To fast solve (18), we prefer the set of constraints W to be matrices with relatively small Frobenius norm. Then, the alternate search strategy can be employed to alternatively minimize (18) with respect to one variable while ﬁxing the other one. To update Ω Rm2 m1 with ﬁxed X, we solve

min Ω Tr 2XPt YT ΩT + λ3ΩΩT

+ΩY (Pt + λ2Lt) YT ΩT

where λ3 is the Lagrange multiplier for Ω 2 F . The analytical solution is computed by setting its ﬁrst derivative to zero: Ω = XPt YT Y (Pt + λ2Lt) YT + λ3I 1. To update X Rm2 n with ﬁxed Ω, we solve (21) for each column. min xi

xi Pt iiΩyi+λ1Qt iihi Pt ii+λ1Qt ii

2 s.t. xi 0 T0 (21)

The analytical solution is obtained by applying hard thresholding operation: setting the smallest m2 T0 elements (in magnitude) of Pt iiΩyi+λ1Qt iihi Pt ii+λ1Qt ii to 0. The update for X can be efﬁciently implemented in parallel. 2) Update auxiliary variables.

Pt+1 ii = exp

xt+1 i Ωt+1yi 2 2 σ2

Qt+1 ii = exp

xt+1 i hi 2 2 σ2

Rt+1 ij = exp

Ωt+1yi Ωt+1yj 2 2 σ2

As summarized in Algorithm 1, the above update steps are alternatively minimized until convergence.

1The normalized Laplacian matrix is often used in practice.

Algorithm 1 Discriminative Analysis Dictionary Learning

Input: Training data Y and corresponding target codes H; Number of each sample s nearest neighbors k; Regularization parameters λ1, λ2 and λ3; Gaussian kernel parameter σ. Output: The analysis dictionary Ω. 1: Set X(0) = H, P(0) = I, Q(0) = I, R(0) = 1 for initialization, t = 0; 2: while not convergence do 3: t t + 1; 4: Compute the weighting matrix W(t) and its Laplacian matrix L(t); 5: Update Ω(t) by solving (20); 6: Update X(t) by solving (21) in parallel; 7: Update P(t), Q(t) and R(t) via (22), (23) and (24); 8: end while

4.3 Convergence Analysis According to Lemma 1, when Ω and X are ﬁxed, the following equation holds:

J (Ω, X) = inf P,Q,R ˆJ (Ω, X, P, Q, R) . (25)

It follows that min Ω,X J (Ω, X) = min Ω,X,P,Q,R ˆJ (Ω, X, P, Q, R) . (26)

Therefore, minimizing J (Ω, X) is equivalent to minimizing the augmented function ˆJ (Ω, X, P, Q, R) on the enlarged domain. According to the properties of half-quadratic minimization (Nikolova and Ng 2005; Yuan and Hu 2009; He et al. 2014) and alternate search strategy, we have ˆJ Ωt+1, Xt+1, Pt+1, Qt+1, Rt+1

ˆJ Ωt+1, Xt+1, Pt, Qt, Rt

ˆJ (Ωt, Xt, Pt, Qt, Rt) . (27)

The objective function is non-increasing at each alternative minimization step. What s more, according to the property of correntropy (Liu, Pokharel, and Principe 2007), the objective function J (Ω, X) in (13) is bounded below, and thus by (26) we obtain that ˆJ (Ω, X, P, Q, R) is also bounded. Consequently, we can conclude that ˆJ (Ωt, Xt, Pt, Qt, Rt) decreases step by step until Algorithm 1 converges.

5 Experiments 5.1 Datasets We demonstrate the performance of our proposed method on ﬁve benchmark databases: two face datasets (Yale B and AR), and one object categorization dataset (Caltech 101), and one scene categorization dataset (Scene 15), and one action recognition dataset (UCF 50). We use the features of these databases provided by Jiang2 and Corso3.

2http://www.umiacs.umd.edu/ zhuolin/projectlcksvd.html. 3http://www.cse.buffalo.edu/ jcorso/r/actionbank.

The Extended Yale face database B (hereafter referred to as Yale B) (Georghiades, Belhumeur, and Kriegman 2001) includes 2,414 face images of 38 persons under 64 illumination conditions, which is challenging due to plentiful expressions and varying illumination conditions. All the original images are cropped to 192 168 pixels and then projected onto 504-dimensional vectors with a randomly generated matrix to obtain random-face features. We randomly select 32 images each person for training and the rest for testing. The AR face database (Martinez and Benavente 1998) contains a number of color face images from 126 people. Each person has 26 frontal face images which are taken during two sessions. This database includes frontal views of faces with different facial expressions, lighting conditions, and occlusion conditions (sunglasses and scarves). All the images are cropped and scaled to 165 120. Following the standard evaluation protocol, a subset consisting of 2,600 images from 50 males and 50 females is obtained. The features used here are 540-dimensional random-face features. We randomly select 20 images each human subject for training and the other 6 images for testing. The Caltech 101 database (Li, Fergus, and Perona 2007) is comprised of 9,144 images from 102 classes. Each category has 31 to 800 images. We use the standard Bag-of-Features (Bo F) + Spatial Pyramid Matching (SPM) frame (Lazebnik, Schmid, and Ponce 2006) for feature extraction. Dense SIFT descriptors are ﬁrst extracted from patches of size 16 16 which are sampled by a grid with a 6-pixel step size. Then, we compute the SPM features with 1 1, 2 2, and 4 4 subregions. Vector quantization based coding method is used to extract mid-level features and high-dimensional features are obtained by max pooling. Finally, the high-dimensional features are reduced to 3,000 dimensions by Principal Component Analysis (PCA). 30 images per category are randomly selected for training and the remaining for testing. The ﬁfteen scene dataset (Scene 15) was ﬁrst introduced in (Lazebnik, Schmid, and Ponce 2006). The number of samples in each category ranges from 200 to 400, and the average image size is around 250 300 pixels. This database contains 15 scenes, such as kitchen, bedroom, and country scenes. Similar to Caltech 101 s features, the image features used here are generated by extracting dense SIFT descriptors in local regions, encoding local patch features, max pooling in spatial pyramid and being reduced to 3,000 dimensions by PCA. Following the the common experimental settings, 100 images per category are randomly chosen as training data with the rest as testing data. The UCF 50 (Reddy and Shah 2013) is an action recognition database with 50 action categories, consisting of 6,680 realistic human action videos taken from You Tube. For all the 50 categories, the action videos are divided into 25 groups, where each group contains over 4 action clips. The action clips in the same group share some common features, such as similar viewpoint, similar background, the same person, and so on. We utilize the action bank features (Sadanand and Corso 2012) and ﬁve-fold data splitting to evaluate our method, where four folds are used for training and the remaining one fold for testing. We employ PCA to reduce the

Table 1: Major parameters, determined by cross-validation.

Yale B AR Caltech 101

UCF 50 k 7 7 5 5 7 λ2 0.001 0.001 0.001 0.010 0.001 λ3 0.100 0.100 4.000 1.000 0.100

Table 2: Classiﬁcation accuracies (%) on ﬁve datasets.

Yale B AR Caltech 101

UCF 50 ADL+SVM 95.4 96.1 64.5 90.1 72.3 SRC 96.5 97.5 70.7 91.8 75.0 CRC 97.0 98.0 68.2 92.0 75.6 DLSI 97.0 97.5 73.1 91.7 75.4 FDDL 96.7 97.5 73.2 92.3 76.5 LC-KSVD 96.7 97.8 73.6 92.9 70.1 DPL 97.5 98.3 73.9 97.7 77.4 DADL 97.7 98.7 74.6 98.3 78.0

features to 5,000 dimensions.

5.2 Experiment Setup We compare our proposed DADL method with the following methods: the baseline Analysis Dictionary Learning + Support Vector Machine (ADL+SVM) (Shekhar, Patel, and Chellappa 2014), the classical Sparse-Representation-based Classiﬁer (SRC) (Wright et al. 2009) and Collaborative Representation-based Classiﬁer (CRC) (Zhang, Yang, and Feng 2011), and three state-of-the-art dictionary learning methods: Dictionary Learning with Structured Incoherence (DLSI) (Ramirez, Sprechmann, and Sapiro 2010), Fisher Discrimination Dictionary Learning (FDDL) (Yang et al. 2011), Label Consistent K-SVD (LC-KSVD) (Jiang, Lin, and Davis 2013), and the recently proposed projective Dictionary Pair Learning (DPL) (Gu et al. 2014). For fair comparison, we follow the experimental settings in (Gu et al. 2014) for all the competing methods. We set the Gaussian kernel parameter σ = 10 and the balance weight λ1 = 10 in all our experiments. The experimental results are insensitive to σ [7, 13] and λ1 [10, 15]. Since sparsity level depends on the sparse target codes H that is determined by information theoretic rules (refer to Section 3.1), it can be well treated. The other major parameters (k, λ2, λ3) on each database have been tuned by cross validation. The best (k, λ2, λ3) for each database are listed in Table 1.

5.3 Results and Analysis Given training data and corresponding target codes, an optimized dictionary Ω can be learned by Algorithm 1. Then, we can code both training and testing samples via the learned Ω. Finally, we treat these coding vectors as new features and employ k NN classiﬁer to perform classiﬁcation. All the experiments are repeated 20 times with different random spits

Table 3: Training time (s) on ﬁve datasets.

Yale B AR Caltech 101

UCF 50 DPL 5.92 15.21 180.54 56.84 652.03 DADL 4.23 11.16 121.47 36.52 330.23

Table 4: Testing time (ms) on ﬁve datasets.

Yale B AR Caltech 101

UCF 50 DPL 0.19 0.42 1.45 1.36 1.62 DADL 0.16 0.39 1.39 1.31 1.48

of training and testing images on each dataset. Reliable results of different methods are reported in Table 2. By contrast, DADL achieves obviously higher accuracy than the basic ADL+SVM framework, which indicates that our proposed method signiﬁcantly improves the discriminative ability of ADL. Compared with synthesis dictionary based classiﬁcation methods (SRC, CRC, DLSI, FDDL, and LC-KSVD), our proposed DADL method achieves the best performance. Besides, our approach also gives a better result in comparison with the recently proposed projective Dictionary Pair Learning (DPL) method, which combines discriminative synthesis dictionary learning and ADL. For some datasets, all the competitors achieve over 95% accuracy, so our method s improvement is not visibly big. We can observe that DPL and our DADL obviously outperform other DL methods, which from another side proves the strong vitality of analysis dictionary in classiﬁcation tasks. DPL in (Gu et al. 2014) also markedly outperforms state-ofthe-art DL methods in terms of faster running time. Therefore, we conduct extra experiments to further evaluate the efﬁciency of our method. Our experiments are run via MATLAB R2013a on a desktop PC with an Intel Core i7-3770 processor at 3.40 GHz and 16.00 GB RAM. As reported in Table 3 and 4, the less time consumption shows the superiority of our method. Thus we can safely conclude that our proposed DADL method performs better in classiﬁcation than state-of-the-art dictionary learning methods.

6 Conclusion

In this paper, we propose a Discriminative Analysis Dictionary Learning (DADL) method. To make analysis dictionary learning (ADL) applicable for pattern classiﬁcation tasks, a code consistent term has been introduced. Meanwhile, based on triplet constraints, a discriminative local topology preserving loss function has been developed, which simultaneously preserves neighborhood relationship as well as proximities in a supervised manner. Besides, correntropy induced metric is utilized as a robust measure to improve robustness. Based on half-quadratic (HQ) technique and alternate search strategy, we have developed an iterative method to speed up the ADL process. Experimental results on ﬁve benchmark

datasets show the effectiveness of our method against stateof-the-art dictionary learning methods.

Acknowledgments

This work is funded by the National Natural Science Foundation of China (Grant No. 61402079, 61473289, 61175003, 61172109), the Foundation for Innovative Research Groups of the NSFC (Grant No. 71421001), the Youth Innovation Promotion Association CAS (Grant No. 2015190), the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDB02000000), the Open Project Program of the National Laboratory of Pattern Recognition (NLPR). We thank the anonymous reviewers and Dr. Ming Li from DLUT for their helpful comments.

Chen, B., and Principe, J. C. 2012. Maximum correntropy estimation is a smoothed map estimation. IEEE SPL 19(8):491 494. Chen, B.; Xing, L.; Liang, J.; Zheng, N.; and Principe, J. C. 2014. Steady-state mean-square error analysis for adaptive ﬁltering under the maximum correntropy criterion. IEEE SPL 21(7):880 884. Chen, B.; Wang, J.; Zhao, H.; Zheng, N.; and Principe, J. C. 2015. Convergence of a ﬁxed-point algorithm under maximum correntropy criterion. IEEE SPL 22(10):1723 1727. Georghiades, A.; Belhumeur, P.; and Kriegman, D. 2001. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE TPAMI 23(6):643 660. Gu, S.; Zhang, L.; Zuo, W.; and Feng, X. 2014. Projective dictionary pair learning for pattern classiﬁcation. In Proc. NIPS, 793 801. He, R.; Hu, B.; Zheng, W.; and Kong, X. 2011. Robust principal component analysis based on maximum correntropy criterion. IEEE TIP 20(6):1485 1494. He, R.; Tan, T.; Wang, L.; and Zheng, W. 2012. l2,1 regularized correntropy for robust feature selection. In Proc. CVPR, 2504 2511. He, R.; Zheng, W.; Tan, T.; and Sun, Z. 2014. Half-quadraticbased iterative minimization for robust sparse representation. IEEE TPAMI 36(2):261 275. He, R.; Tan, T.; and Wang, L. 2014. Recovery of corrupted low-rank matrix by implicit regularizers. IEEE TPAMI 36(4):770 783. Jiang, Z.; Lin, Z.; and Davis, L. S. 2013. Label consistent K-SVD: learning a discriminative dictionary for recognition. IEEE TPAMI 35(11):2651 2664. Lazebnik, S.; Schmid, C.; and Ponce, J. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc. CVPR, volume 2, 2169 2178. Li, F.; Fergus, R.; and Perona, P. 2007. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. CVIU 106(1):59 70.

Liu, W.; Pokharel, P. P.; and Principe, J. C. 2007. Correntropy: Properties and applications in non-gaussian signal processing. IEEE TSP 55(11):5286 5298. Lu, C.; Tang, J.; Lin, M.; Lin, L.; Yan, S.; and Lin, Z. 2013. Correntropy induced l2 graph for robust subspace clustering. In Proc. ICCV, 1801 1808. Luo, D.; Ding, C.; Nie, F.; and Huang, H. 2011. Cauchy graph embedding. In Proc. ICML, 553 560. Martinez, A., and Benavente, R. 1998. The AR face database. CVC Tech. Rep. 24. Nikolova, M., and Ng, M. K. 2005. Analysis of half-quadratic minimization methods for signal and image recovery. SIAM J. Scientiﬁc Computing 27(3):937 966. Ramirez, I.; Sprechmann, P.; and Sapiro, G. 2010. Classiﬁcation and clustering via dictionary learning with structured incoherence and shared features. In Proc. CVPR, 3501 3508. Ravishankar, S., and Bresler, Y. 2013. Learning sparsifying transforms. IEEE TSP 61(5):1072 1086. Reddy, K. K., and Shah, M. 2013. Recognizing 50 human action categories of web videos. Mach. Vision Applicat. 24(5):971 981. Rubinstein, R., and Elad, M. 2014. Dictionary learning for analysis-synthesis thresholding. IEEE TSP 62(22):5962 5972. Rubinstein, R.; Bruckstein, A. M.; and Elad, M. 2010. Dictionaries for sparse representation modeling. Proc. IEEE 98(6):1045 1057. Sadanand, S., and Corso, J. J. 2012. Action bank: A highlevel representation of activity in video. In Proc. CVPR, 1234 1241. Shekhar, S.; Patel, V. M.; and Chellappa, R. 2014. Analysis sparse coding models for image-based classiﬁcation. In Proc. ICIP, 5207 5211. Wright, J.; Yang, A. Y.; Ganesh, A.; Sastry, S. S.; and Ma, Y. 2009. Robust face recognition via sparse representation. IEEE TPAMI 31(2):210 227. Yang, M.; Zhang, L.; Feng, X.; and Zhang, D. 2011. Fisher discrimination dictionary learning for sparse representation. In Proc. ICCV, 543 550. Yang, S.; Luo, P.; Loy, C. C.; Shum, K. W.; and Tang, X. 2015. Deep representation learning with target coding. In Proc. AAAI. Yuan, X., and Hu, B. 2009. Robust feature extraction via information theoretic learning. In Proc. ICML, 1193 1200. Zhang, Y.; Sun, Z.; He, R.; and Tan, T. 2013. Robust subspace clustering via half-quadratic minimization. In Proc. ICCV, 3096 3103. Zhang, S.; Zhang, M.; He, R.; and Sun, Z. 2014. Transforminvariant dictionary learning for face recognition. In Proc. ICIP, 348 352. Zhang, G.; He, R.; and Davis, L. S. 2014. Jointly learning dictionaries and subspace structure for video-based face recognition. In Proc. ACCV, 348 352. Zhang, L.; Yang, M.; and Feng, X. 2011. Sparse representation or collaborative representation: Which helps face recognition? In Proc. ICCV, 471 478.