# robust_nonnegative_dictionary_learning__5b616fe5.pdf

Robust Non-Negative Dictionary Learning

Qihe Pan1, Deguang Kong2, Chris Ding2 and Bin Luo3

1Beihang University, China; 2University of Texas, Arlington, U.S.A; 3Anhui University, China panqihe2006@gmail.com; doogkong@gmail.com; chqding@uta.edu; luobin@ahu.edu.cn

Dictionary learning plays an important role in machine learning, where data vectors are modeled as a sparse linear combinations of basis factors (i.e., dictionary). However, how to conduct dictionary learning in noisy environment has not been well studied. Moreover, in practice, the dictionary (i.e., the lower rank approximation of the data matrix) and the sparse representations are required to be nonnegative, such as applications for image annotation, document summarization, microarray analysis. In this paper, we propose a new formulation for non-negative dictionary learning in noisy environment, where structure sparsity is enforced on sparse representation. The proposed new formulation is also robust for data with noises and outliers, due to a robust loss function used. We derive an efﬁcient multiplicative updating algorithm to solve the optimization problem, where dictionary and sparse representation are updated iteratively. We prove the convergence and correctness of proposed algorithm rigorously. We show the differences of dictionary at different level of sparsity constraint. The proposed algorithm can be adapted for clustering and semi-supervised learning.

Introduction In dictionary learning, a signal is represented as a sparse representation of basis factors (called dictionary), instead of predeﬁned wavelets (Mallat 1999). Dictionary learning has shown the state of the art performance, and has many applications for image denoising (Elad and Aharon 2006)), face recognition (Protter and Elad 2009), document clustering, microarray analysis, etc. Recent researches (Raina et al. 2007; Delgado et al. 2003; Mairal et al. 2009; Olshausen and Fieldt 1997) have shown the sparsity helps to eliminate data redundancy, and capture the correlations inherent in data. Compared with Principal Component Analysis (PCA), dictionary learning does not have a strict constraint (such as orthogonal) on the basis vector, and thus the dictionary can be learned in a more ﬂexible way. The key to dictionary learning, at different context with different constraints, is to solve the corresponding optimization problem. For example, different objective functions (Aharon, Elad, and Bruckstein 2006; Mairal et al. 2010) have been proposed to meet the requirement of speciﬁc applications, e.g., supervised dic-

Copyright c 2014, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

tionary learning (Mairal et al. 2008), a joint learning using dictionary learning and clustering-based sparse representation (Dong et al. 2011), online dictionary learning (Kasiviswanathan et al. 2011), tensor decomposition for image storage (Zhang and Ding 2013), etc. In this paper, we focus on a general non-negative dictionary learning problem in noisy environment, i.e., data could be noisy and have missing values. To summarize, the main contribution of this paper is in three-fold. (1) We formulate the non-negative dictionary learning problem in noisy environment through the optimization of a nonsmooth loss function over non-negative set with LASSOtype regularization term. (2) It is challenging to solve this problem due to the non-smoothness of reconstruction error term and sparsity regularization term. Different from the recent second order iterative algorithms (e.g., (Lee et al. 2007; Aharon, Elad, and Bruckstein 2006)) used for dictionary learning, we propose an efﬁcient multiplicative updating algorithm, where the convergence and correctness of algorithm are rigorously proved. (3) As shown in experiment, our algorithm converges very fast. The learned sparse coding Y can be used for clustering and semi-supervised learning.

Robust Dictionary Learning Objective In standard dictionary learning, given a set of training signals X = (x1, , xn), where xi 2 <p represents a data of p-dimension. We use A 2 <p k to represent ﬁxed size dictionary, where A = [a1, a2, , ak], ai 2 <p. For each signal x 2 <p, we need to optimize the loss function L(x, A) such that the loss is small using dictionary representation. Note this dictionary representation usually needs to be sparse. Usually, we need to optimize LASSO type objective (Tibshirani 1994), i.e.,

min y ||x Ay||2 + ||y||1, (1)

where y 2 <k is sparse representation of signal x using dictionary A, > 0 is a parameter. Note standard least square loss is used in Eq.(1), which implies Gaussian noises existed in input data signals. However, in real world, data measurement could be noisy and have missing values. It is known that least square loss is prone to noises and large deviations. Replacing the least square loss of Eq.(1) with more robust 1 loss, robust dictionary learning becomes,

i ||xi Ayi||1 + ||yi||1, (2)

Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence

where dictionary A 2 C, where C is the feasible domain of problem, i.e., C = {A|A 0}, or C = {A|kajk2 1} (aj is j-th column of A). In real world problems (such as images features, text vector, etc), input data are non-negative values, which requires the dictionary to be non-negative, i.e., A 0. Naturally, the sparse representation yi for each signal xi should be nonnegative. Problem Formulation Thus, in this paper, we focus on feasible domain of A to be: C = {A|A 0}. Then objective of Eq.(2) becomes,

i ||xi Ayi||1 + ||yi||1 + βk Ak2 F ,

s.t. A 0, yi 0. (3)

Note smooth term k Ak2 F is added in Eq.(3) to avoid the trivial solution. In practice, we require β > 0. If β = 0, suppose (A , y i ) is the current optimal solution for Eq.(3), then we can always get a better solution (A , y i ) with smaller objective function value of Eq.(3), where A = A , y i = 1 y i and > 1. Using matrix formulation, let Y = [y1, y2, , yn], then Eq.(3) becomes,

min Y,A ||X AY||1 + ||Y||1 + βk Ak2 F ,

s.t. A 0, Y 0, (4)

where ||Y||1 = P

ki |Yki|. By introducing Lagrangian multiplier to enforce the constraint, Eq.(4) can be equivalently expressed as,

min Y,A ||X AY||1,

s.t. A 0, Y 0, ||Y||1 q, k Ak2 F p. (5)

The optimization of Eq.(4) is general non-convex. But if one of the variables (A or Y) is known, we can ﬁnd the global optimal solution w.r.t the other variable. Note two non-smooth 1 terms are involved in Eq.(4), and thus it is a bit challenging to solve Eq.(4). However, it does not add any difﬁculty, because in Eq.(4), 1 term appeared together with non-negative constraint. Thus 1 term w.r.t sparse coding can be rewritten as, ||Y||1 = Tr(EY), where E 2 <k n is a matrix with all ones. Algorithm A main contribution of this paper is to derive the following multiplicative updating algorithms for problem of Eq.(4), i.e.,

Ajk ( Ajk [X WYT ]jk [(AY) WYT + 2βA]jk , (6)

Yki ( Yki [AT X W]ki [AT (AY) W + E]ki (7)

where E is a all-ones matrix, W is a matrix given by Wij = (X AY)2 ij+ 2 1/2 and is the Hadamard product, i.e., elementwise product between two matrices. Here we assume Hadamard product has higher operator precedence over regular matrix product, i.e., AB CD = A(B C)D. Note

Figure 1: Computed A = (a1, a2, , ak) on Yale B dataset (K = 31) shown as 3 rows using Eq.(4) at = 0.5.

Figure 2: Computed A = (a1, a2, , ak) on Yale B dataset (K = 31) shown as 3 rows using Eq.(4) at = 1.

that as ! 0, (X AY)2 ij + 2 1/2 ! |(X AY)ij|. We add a small number here to prevent overﬂow of Wij in the case (X AY)2 ji ' 0. 1 Because is not zero, the algorithm updating rules of Eqs.(6,7) actually minimize the objective function

min A 0,Y 0

j=1 ((X AY)2 ji+ 2)1/2+

j=1 |Yij|+βk Ak2 F . (8)

Illustration of dictionary at different To simply the problem, we ﬁx β = 0.1. On Yale B face data set, each image xi 2 <p is linearized into a vector, thus X = [x1, x2, , xn] 2 <p n is used to compute the dictionary A 2 <p k and sparse coding Y 2 <k n. Each dictionary ai 2 <p in the computed A = [a1, a2, , ak] is corresponding to each category (K = 31), and thus shown as an image. Dictionary results A, at = 0.5, = 1 of Eq.(4), are shown in Fig.1 and Fig.2, respectively. Clearly, the dictionary changes slightly at different sparsity constraint (say, different values). Generally, = 1 gives slightly better visual results as compared to = 0.5, due to larger sparsity enforcement.

Connections to Related Works Connection to Sparse Coding Our model has also some connections to sparse coding (Olshausen and Fieldt 1997), lasso (Tibshirani 1994) and elastic net (Zou and Hastie 2005). The basic idea of sparse coding is to represent a feature vector as linear combination of few bases from a predeﬁned dictionary, hence inducing a concept of sparsity. Given dictionary A, our model is to ﬁndsparse representation y for each signal x, i.e.,

min y ||x Ay||1 + ||y||1. (9)

If we replace the 1 norm on the error term (1st term) by 2 norm, this is exactly the LASSO. If we add the smooth

1 is set to machine precision in our experiments.

term of ||y||2 2 to Eq.(9), this is identical to the elastic net, which improves the smoothness of the process. Using matrix format, Eq.(4) becomes,

min Y,A ||X AY||1 + ||Y||1 + γ||Y||2 F , s.t. A 0, Y 0, (10)

because ||Y||2 F = P

i ||yi||2 2. Note the multiplicative rule of Eq.(6) for dictionary A will not change, we only need to change multiplicative rule for Y of Eq.(7) to,

Yki ( Yki [AT X W]ki [AT (AY) W + E + 2γY]ki (11)

If we use original data X as dictionary, Eq.(4) becomes, min S ||X XS||1 + ||S||1, s.t. A 0, S 0, (12)

where S 2 <n n acts as pairwise similarity between data points, and can be updated with the following rule,

Sij ( Sij [XT (X ˆ W)]ij [XT (XS) ˆ W + E]ij , (13)

where ˆ Wij = [(X XS)ij] 1

2 , and E 2 <n n is a matrix with all ones. The correctness of Eq.(11) and Eq.(13) can be similarly proved as that of Eq.(4), which has been sharply observed in (Kong and Ding 2012a) . Connection to Non-negative Matrix Factorization It has been shown non-negative matrix factorization (Lee and Seung 2000) has close relations with dictionary learning. In our model of Eq.(4), if we set λ = 0, this is exactly robust non-negative matrix factorization using 1 norm, where dictionary A = [a1, a2, , ak] plays the role of basis vector in NMF, and Y = [y1, y2, , yn] is the cluster indicator, k is the dimension of subspace. Sparse term ||Y||1 enforces the cluster indicator of NMF solution to be sparse. It also has clear differences with NMF model using 2,1 error function (Kong, Ding, and Huang 2011), (Ding and Kong 2012) (using our notation), i.e.,

k X AYk2,1 =

j=1 (X AY)2 ji =

i=1 kxi Ayik,

(14) where index i (number of data), j (dimension of features) are differently treated. Back to our model, Eq.(4) can be rewritten as,

min Y,A ||X AY||2,1 + ||Y||1 + βk Ak2 F , s.t. A 0, Y 0, (15)

where dictionary A is learned with a robust 2,1 function. Connection to k-means Clustering The objective function of the K-means clustering (Mac Queen 1967) is JK2 = PK k=1 P i2Ck kxi fkk2, where fk is the centroid of the k-th cluster Ck. If we use a more robust error function of L1 norm, we have JK1 = PK k=1 P i2Ck kxi fkk1. In our formulation of Eq.(4), set = 0, β = 0, let sparse representation Y be the solution of the clustering: Yki = 1 if xi belongs to cluster Ck; otherwise, Yki = 0 Thus we have

k=1 ak Ykik1

i2Ck kxi Ayik1 =

i=1 kxi Ayik1 = k X AYk1.

Thus our model of Eq.(4) implicitly performs a 1 K-means clustering. If 6= 0, β 6= 0, it performs a constraint k-means clustering by imposing constraint ||Y||1 < q on cluster indicators.

Convergence of the Algorithm We give the convergence of algorithm in Theorem 1. Theorem 1. (A) Updating Y using the rule of Eq.(7) while ﬁxing A, the objective function of Eq.(4) monotonically decreases. (B) Updating A using the rule of Eq.(6) while ﬁxing Y, the objective function of Eq.(4) monotonically decreases. To prove Theorem 1, we need the following deﬁnition:

Wij = (X AY)2 ij + 2 1/2 , k Bk2 W = P

ij B2 ij Wij, which are used in the following Lemmas. Note W is a constant given current A, Y. Updating Y We focus on updating Y while ﬁxing A. Let LHS be left-hand-side of an equation, and RHS be righthand-side of an equation. The proof of Theorem 1(A) requires the following two lemmas. Note in the updating process of Y from Yt to Yt+1, A, W remain the same. Thus,

k X AYt+1k2 W = X

ji (X AYt+1)2 ji Wji,

k X AYtk2 W = X

ji (X AYt)2 ji Wji,

where Wji = [(X AYt)2 ji + 2] 1/2.

Lemma 2. Let Yt be the old Y [on the RHS of Eq.(7)] and Yt+1 be the new Y [on the LHS of Eq.(7)]. Under the updating rule of Eq.(7), the following holds

1 2k X AYt+1k2 W + Tr(EYt+1)

2k X AYtk2 W + Tr(EYt). (16)

Lemma 3. Under the updating rule of Eq.(7), the following holds

k X AYt+1k1 k X AYtk1 (17) 1 2

k X AYt+1k2 W k X AYtk2 W ,

The key idea of proof of Lemma 2 is to construct an auxiliary function to show the convergence of the objective function. The key idea of proof of Lemma 3 is to compute the difference between LHS and RHS of Eq.(17). Proof of Theorem 1. From Lemma 3, the RHS of Eq.(17) is negative or zero. Therefore

[k X AYt+1k1 + Tr(EYt+1)]

[k X AYtk1 + Tr(EYt)]

2k X AYt+1k2 W 1

2k X AYtk2 W

+ Tr(EYt+1) Tr(EYt) 0. (18)

This proves that the objective decreases monotonically. u Updating A We now focus on updating A while ﬁxing Y. The proof of Theorem 1(B) requires the following two lemmas: Lemma 4. Let At be the old A [on the RHS of Eq.(6)] and At+1 be the new A [on the LHS of Eq.(6)]. Under the updating rule of Eq.(6), the following holds

1 2 k X At+1Yk2 W + βk At+1k2 F 1

2 k X At Yk2 W + βk Atk2 F ,

Lemma 5. Under the updating rule of Eq.(7), the following holds

k X At+1Yk1 k X At Yk1 1 2

h k X At+1Yk2 W k X At Yk2 W i , (19)

The proofs of Lemmas 4, 5 are similar to the proofs of Lemmas 2, 3 and thus are skipped due to space limitation. Poof of Theorem 1(B). From Lemma 5, the RHS value of Eq.(19) is negative or zero. Therefore

[k X At+1Yk1 + ||Y||1 + βk At+1k2 F ]

[k X At Yk1 + ||Y||1 + βk Atk2 F ]

2k X At+1Yk2 W + βk At+1k2 F ]

2k X At Yk2 W + βk Atk2 F ] 0. (20)

This proves that the objective decreases monotonically. u Remark Proposed multiplicative update algorithm converges to a local optimum due to the non-convexity of f(A, Y) w.r.t both A and Y. However, even local minima still provides very desirable properties for dictionary learning tasks. It is usually very difﬁcult to choose step size to guarantee the convergence of general gradient descent method. The proposed multiplicative method provides a smart choice for step size, and thus for a better dictionary. Due to space limit, the proof of Lemma 2 is omitted here. Proof of Lemma 3 The left-hand-side (LHS) of Eq.(17) is

(X AYt+1)2 ji + 2 q

(X AYt)2 ji + 2 i

(X AYt+1)2 ji + 2 1/Wji]

using the deﬁnition of Wji = [(X AYt)2 ji + 2] 1/2. The right-hand-side (RHS) of Eq.(17) is

h (X AYt+1)2 ji Wji (X AYt)2 ji Wji i

h [(X AYt+1)2 ji + 2]Wji [(X AYt)2 ji + 2]Wji i

h [(X AYt+1)2 ji + 2]Wji 1/Wji i

LHS RHS = X

(X AYt+1)2 ji + 2

+ 1 W 2 ji + [(X AYt+1)2 ji + 2] i

(X AYt+1)2 ji + 2 1 Wji

Correctness of the Algorithm We prove that the converged solution satisﬁes the Karush Kuhn-Tucker condition of the constrained optimization theory. We prove the correctness of the algorithm w.r.t. A and Y, respectively.

Figure 3: umist data, half of the images from each category are occluded. Occlusion size: 7 x 7.

Figure 4: Caltech data. Shown images are from face category.

Table 1: Descriptions of occluded datasets

Dataset #Size #Dimension #Class occluded size AT&T 400 644 40 10 x 10 Mnist 150 784 10 8 x 8 Umist 360 644 20 7 x 7 Yale B 1984 504 31 N/A Caltech 600 432 20 N/A

Theorem 6. The converged solution Y of the updating rule of Eq.(7) satisﬁes the KKT condition of optimization theory. Theorem 7. The converged solution A of the updating rule of Eq.(6) satisﬁes the KKT condition of optimization theory. Proof of Theorem 6. Let J(Y) = k X AYk1 + k Yk1 + βk Ak2 F of Eq.(10). The KKT condition for Y with the constraints Yki 0, i = 1 n, k = 1 K is @J(Y)

@Yki Yki = 0, 8 i, k. The derivative is

(X AY)ji0 q

(X AY)2 ji0 + 2 @(X AY)ji0

j=1 Wji(X AY)ji Ajk + Eki (21)

= (AT X W)ki + [AT (AY) W + E]ki.

Thus the KKT condition for Y is

[ (AT X W)ki + (AT (AY) W + E)ki]Yki = 0, (22)

On the other hand, once Y converges, according to the updating rule of Eq.(7), the converged solution Y satisﬁes

Y ki = Y ki (AT X W)ki (AT (AY) W + E)ki , (23)

which can be written as [ (AT X W)ki + (AT (AY) W + E)ki]Y ki = 0. This is identical to Eq.(22). Thus the converged solution satisﬁes the KKT condition. u The proof of Theorem 7 is similar to that of Theorem 6, and thus is skipped due to space limit.

Experiment In this section, we empirically evaluate the proposed approach, where our goal is to examine the convergence of the

proposed algorithm, and also compare against other robust dictionary learning methods in noisy environment. We do experiment on 5 data sets in our experiments, including two face datasets AT&T 1 and Umist, Yale B, one digit datasets mnist (Lecun et al. 1998) and one image scene datasets Caltech101 (Dueck and Frey 2007). Table 1 summarizes the characteristics of the datasets. We generate occluded image datasets corresponding to above 3 original data sets (except Yale B and Caltech). For Yale B dataset, the images are taken under different poses with different illumination conditions. The shading parts of the images play the similar role of occlusion (noises). Thus we use the original Yale B data. For Caltech dataset, the natural scenes images are polluted by noises when pictures are taken. For the other 4 datasets, half of the images are selected from each category for occlusion with block size of wxw pixels (e.g., w = 10). The locations of occlusions are random generated without overlaps among the images from the same category. A demonstration of images are shown in Figs.3,4. Convergence of the algorithm We show the convergence of our algorithm for Eq.(4) in ﬁrst 1000 iterations on dataset AT&T and Umist in Fig.5(b) and Fig.5(c), respectively. xaxis is the number of iteration, y-axis is the value of log function of Eq.(4) at λ = 2. We use results G 2 {0, 1}k n computed from standard k-means clustering, to initialize Y = G + 0.3, and then dictionary A 2 <p k is computed from the centroid of each category. Experiment results indicate our algorithm of Eqs.(6, 7) converges very fast. We note Alternating direction method (ADM) (Bertsekas 1996) can be used to solve Eq.(4). We show the convergence of ADM on dataset AT&T at λ = 2 in Fig.5(a), where objective function of Eq.(4) decreases from 1.2426e + 4 to 5.032e + 3 in 838 iterations. As compared to ADM, our algorithm decreases very fast at ﬁrst, and guarantees monotonically decreaseing in each step. Data Clustering experiment As is shown before, the obtained sparse coding Y can be used as cluster indicator to do clustering tasks, where each data xi is attributed to category k, such that k = arg maxk0 Yk0i. The evaluation metrics (B uhler and Hein 2009) we used here are clustering accuracy, normalized mutual information and purity. These measurement are widely used in the evaluation of different clustering approaches. The larger values of these metrics indicate the better performance of clustering methods. Compared Methods We compare the proposed method with the following related methods: (1) k-means clustering (k-means); (2) standard non-negative matrix factorization using least square error function (L2NMF); (3) nonnegative matrix factorization with 1 sparse constraint on cluster indicator (L2NMFs) (Kim et al. 2011), which optimizes: min A 0,Y 0 k X AYk2 F + k Yk1 + βk Ak2 F ; (4) sparse non-negative matrix factorization with group sparse constraint (L2NMFgs) on cluster indicator (Kim, Monteiro, and Park 2012), which optimizes: min A 0,Y 0 k X AYk2 F + Pn j=1(Pk i=1 |Yij|)2 + βk Ak2 F ; (5) robust

1http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase. html

ORJ REMHFWLYH

(a) ADM algorithm on AT&T dataset

ORJ REMHFWLYH

(b) Multiplicative algorithm on AT&T dataset

ORJ REMHFWLYH

(c) Multiplicative algorithm on Umist dataset

Figure 5: Convergence of proposed algorithm for solving Eq.(4) at = 2, β = 0.1. x-axis: # of iteration; y-axis: log objective function value of Eq.(4). (a) ADM algorithm on AT&T dataset; (b) Multiplicative algorithm on AT&T dataset; (c) Multiplicative algorithm on Umist dataset.

non-negative matrix factorization using 2,1 error function (L21NMF) (Ding et al. 2006), which optimizes: min A 0,Y 0 k X AYk2,1; (6) robust non-negative matrix factorization using 2,1 error function and 1 sparse constraint on cluster indicator (L21NMFs) (Kong, Ding, and Huang 2011), which optimizes: min A 0,Y 0 k X AYk2,1 + k Yk1 + βk Ak2 F ; (7) robust non-negative matrix factorization using 1 error function (L1NMF) (Ke and Kanade 2005), which optimizes: min A 0,Y 0 k X AYk1. Experiment Settings In all above methods and our method, β = 0.1 if there is β. is searched in the following set: {0, 0.5, 1, , 4.5, 5} if there is . We show the computed accuracy, normalized mutual information, purity results in Table 2. Results Analysis We make several important observations from experiment results. (1) The proposed method is generally better than the other methods, which validates the effectiveness of proposed method for data clustering tasks, in terms of accuracy, normalized mutual information and purity. (2) In our method, there are two factors contributing to the performance improvement: (a) robust loss function; (b) sparsity constraint enforced on Y. As compared to the the other loss functions used in non-negative matrix factorization, 1 loss is more robust for the noises both in data sample space and feature dimension, and thus gives better performance when data are polluted with noises. For the sparsity constraint, it generally promotes the sparsity of cluster indicator, which slightly improves the performance. This generally holds for different versions of NMF with different loss functions, such as least square loss and 2,1 loss. (3) Group sparsity has been widely used in feature learning and variable selection. Our results, however, indicate that the group sparsity does not help much for the improvement of clustering performance, as compared to more general ﬂat sparsity using LASSO. The reason is that, the goal of group sparsity, which is to select the most discriminant features, is inconsistent with the goal of clustering tasks, which is to ﬁnd the most probable class that a data point is assigned to. Inﬂuence of parameter In all of our experiments, we ﬁx β values. Only one parameter is needed to be tuned. We study the the inﬂuence of parameter for the perfor-

Table 2: Accuracy (ACC), Normalized Mutual information (NMI), Purity (PUR) comparisons of different algorithms: kmeans clustering, least square NMF (L2NMF), NMF with sparsity constraint (L2NMFs), NMF with group sparsity constraint (L2NMFsg), NMF using 2,1 error function (L21NMF), NMF using 2,1 error function with sparsity constraint (L21NMFs), NMF using 1 error function (L1NMF); our methods of using 1 error function with sparsity constraint (L1NMFs) on ﬁve datasets.

Dataset Metric

Clustering Methods kmeans L2NMF L2NMFs L2NMFsg L1NMF L1NMFs (ours) L21NMF L21NMFs

ACC 0.5700 0.5875 0.5975 0.5925 0.6203 0.6310 0.6075 0.6175 NMI 0.7544 0.7575 0.7589 0.7546 0.7944 0.8123 0.7638 0.7737 PUR 0.6025 0.6175 0.5930 0.6250 0.6525 0.6673 0.6425 0.6475

ACC 0.5800 0.5733 0.6133 0.6066 0.6604 0.6733 0.6333 0.6466 NMI 0.5717 0.5451 0.5931 0.5984 0.6097 0.6208 0.5693 0.5937 PUR 0.6234 0.6265 0.6479 0.6333 0.6846 0.6935 0.6466 0.6666

ACC 0.4372 0.4611 0.4616 0.4527 0.4672 0.4872 0.4500 0.4711 NMI 0.6190 0.6005 0.6158 0.6138 0.6490 0.6690 0.6323 0.6752 PUR 0.4872 0.4833 0.4777 0.4933 0.4972 0.5172 0.5033 0.4976

ACC 0.0866 0.1683 0.1577 0.1673 0.1882 0.2148 0.1956 0.2021 NMI 0.0851 0.2864 0.2418 0.2674 0.2864 0.3323 0.3154 0.3214 PUR 0.0957 0.1769 0.1678 0.1769 0.1983 0.2343 0.2046 0.2102

ACC 0.4033 0.4167 0.4433 0.4450 0.5033 0.5267 0.4500 0.4917 NMI 0.4272 0.4364 0.4729 0.4737 0.5272 0.5396 0.4611 0.5025 PUR 0.4300 0.4333 0.4700 0.4733 0.5300 0.5517 0.4817 0.5200

&OXVWHULQJ $FFXUDF\

&OXVWHULQJ $FFXUDF\

(b) Caltech

Figure 6: Clustering Accuracy w.r.t different parameter on datasets mnist and Caltech.

mance of our algorithm. We show the clustering accuracy w.r.t different on data set mnist and Caltech in Fig.6. An interesting observation is that, our method is not very sensitive to . Our method also gives better clustering results at different values. Semi-supervised learning experiment Another interesting application of non-negative dictionary learning is to learn the self-representation (i.e., S computed from Eq.(13)). This can be used to construct a symmetric pairwise similarity S = 1 2(S + ST ), because it captures the relations between different data points using sparse representation. Then S are fed into semi-supervised learning methods, for classiﬁcation purpose. The goal of this group of experiment is to test the effectiveness of S used for semi-supervised learning tasks. We adopt three most widely used semi-learning methods: (1) harmonic function (Zhu, Ghahramani, and Lafferty 2003); (2) local and global consistency (Zhou et al. 2004); (3) Green s function (Ding et al. 2007). We note there are other label propagation methods, e.g., (Kong and Ding 2012b), due to space limit, we do not compare against them here. We compare the classiﬁcation accuracy using 10%, 20%

Table 3: Accuracy comparisons of semi-supervised learning with 10% labeled data. Learning algorithms used: Harmonic function, Green s function and Local and global consistency (LGC). W: results obtained from standard Gaussian kernel; P: results computed using Bi-Stochastication method (Wang, Li, and K onig 2010); S: results obtained from Eq.(13). Dataset: AT&T (A), Mnist (M), Umist (U), Yale B (Y), Caltech (C).

data Harmonic Green s LGC W P S W P S W P S

A 65.34 66.02 67.23 67.13 69.21 69.23 68.25 69.23 69.34 M 64.53 61.23 66.91 62.17 65.23 65.31 63.72 64.76 63.19 U 44.14 45.98 46.39 45.91 45.80 46.38 47.87 48.12 48.21 Y 26.24 27.43 28.14 25.13 23.98 26.94 32.02 31.79 34.23 C 43.28 42.87 45.19 46.47 48.23 48.24 42.17 43.01 43.04

labeled data against the pairwise similarities computed from another two methods: (1) Gaussian kernel (shown as W in Tables3), where Wij = e γ||xi xj||2, and bandwidth γ = 0.7/δ2, where δ is the average distance of k NN (k=3) neighbors of all data points; (2) Bi-Stochastication result (shown as P in Tables 3) (Wang, Li, and K onig 2010). The experiment results indicate that, generally, Eq.(13) results are better than W and P results, except one case on dataset Mnist with LG-consistency method.

We present a non-negative dictionary learning method for noisy data, where an efﬁcient multiplicative updating algorithm is derived. We prove the convergence and correctness of the algorithm, and demonstrate its good performance in data clustering and semi-supervised learning tasks. In future,

we will explore how to effectively incorporate the group structure (or hierarchical structure) into the dictionary learning process, e.g., non-convex regularization. Acknowledgement. This research is partially supported by NSF-CCF-0917274 and NSF-DMS-0915228 grants.

Aharon, M.; Elad, M.; and Bruckstein, A. 2006. k-svd: An algorithm for designing overcomplete dictionaries for sparse representation. In IEEE Transactions on Signal Processing, volume 54, 4311 4322. Bertsekas, D. P. 1996. Constrained Optimization and Lagrange Multiplier Methods. Athena Scientiﬁc. B uhler, T., and Hein, M. 2009. Spectral clustering based on the graph p-laplacian. In ICML, 11. Delgado, K. K.; Murray, J. F.; Rao, B. D.; Engan, K.; Lee, T. W.; and Sejnowski, T. J. 2003. Dictionary learning algorithms for sparse representation. Neural Computation 15(2):349 396. Ding, C. H. Q., and Kong, D. 2012. Nonnegative matrix factorization using a robust error function. In ICASSP, 2033 2036. Ding, C.; Zhou, D.; He, X.; and Zha, H. 2006. R1-pca: rotational invariant l1-norm principal component analysis for robust subspace factorization. In ICML, 281 288. Ding, C.; Jin, R.; Li, T.; and Simon, H. D. 2007. A learning framework using green s function and kernel regularization with application to recommender system. In KDD, 260 269. Dong, W.; Li, X.; Zhang, L.; and Shi, G. 2011. Sparsitybased image denoising via dictionary learning and structural clustering. In CVPR, 457 464. Dueck, D., and Frey, B. J. 2007. Non-metric afﬁnity propagation for unsupervised image categorization. In ICCV. Elad, M., and Aharon, M. 2006. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing 15(12):3736 3745. Kasiviswanathan, S. P.; Melville, P.; Banerjee, A.; and Sindhwani, V. 2011. Emerging topic detection using dictionary learning. In CIKM, 745 754. Ke, Q., and Kanade, T. 2005. Robust l1 norm factorization in the presence of outliers and missing data by alternative convex programming. In CVPR (1), 739 746. Kim, W.; Chen, B.; Kim, J.; Pan, Y.; and Park, H. 2011. Sparse nonnegative matrix factorization for protein sequence motif discovery. Expert Syst. Appl. 38(10):13198 13207. Kim, J.; Monteiro, R.; and Park, H. 2012. Group sparsity in nonnegative matrix factorization. In SDM, 851 862. Kong, D., and Ding, C. H. Q. 2012a. An iterative locally linear embedding algorithm. In ICML. Kong, D., and Ding, C. H. Q. 2012b. Maximum consistency preferential random walks. In ECML/PKDD (2), 339 354.

Kong, D.; Ding, C. H. Q.; and Huang, H. 2011. Robust nonnegative matrix factorization using l21-norm. In CIKM, 673 682. Lecun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, 2278 2324. Lee, D. D., and Seung, H. S. 2000. Algorithms for nonnegative matrix factorization. In NIPS. Lee, H.; Battle, A.; Raina, R.; and Ng, A. 2007. Efﬁcient sparse coding algorithms. In In NIPS, 801 808. NIPS. Mac Queen, J. B. 1967. Some methods for classiﬁcation and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, volume 1, 281 297. University of California Press. Mairal, J.; Bach, F.; Ponce, J.; Sapiro, G.; and Zisserman, A. 2008. Supervised dictionary learning. In NIPS, 1033 1040. Mairal, J.; Bach, F.; Ponce, J.; and Sapiro, G. 2009. Online dictionary learning for sparse coding. In ICML, 87. Mairal, J.; Bach, F.; Ponce, J.; and Sapiro, G. 2010. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research 11:19 60. Mallat, S. 1999. A wavelet tour of signal processing (2. ed.). Academic Press. Olshausen, B. A., and Fieldt, D. J. 1997. Sparse coding with an overcomplete basis set: a strategy employed by v1. Vision Research 37:3311 3325. Protter, M., and Elad, M. 2009. Image sequence denoising via sparse and redundant representations. IEEE Transactions on Image Processing 18(1):27 35. Raina, R.; Battle, A.; Lee, H.; Packer, B.; and Ng, A. Y. 2007. Self-taught learning: transfer learning from unlabeled data. In ICML, 759 766. Tibshirani, R. 1994. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58:267 288. Wang, F.; Li, P.; and K onig, A. C. 2010. Learning a bistochastic data similarity matrix. In ICDM, 551 560. Zhang, M., and Ding, C. H. Q. 2013. Robust tucker tensor decomposition for effective image representation. In ICCV, 2448 2455. Zhou, D.; Bousquet, O.; Lal, T. N.; Weston, J.; and Scholkopf, B. 2004. Learning with local and global consistency. In NIPS, 321 328. Zhu, X.; Ghahramani, Z.; and Lafferty, J. 2003. Semisupervised learning using gaussian ﬁelds and harmonic functions. Proc. Int l Conf. Machine Learning. Zou, H., and Hastie, T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(2):301 320.