# robust_nonnegative_dictionary_learning__5b616fe5.pdf Robust Non-Negative Dictionary Learning Qihe Pan1, Deguang Kong2, Chris Ding2 and Bin Luo3 1Beihang University, China; 2University of Texas, Arlington, U.S.A; 3Anhui University, China panqihe2006@gmail.com; doogkong@gmail.com; chqding@uta.edu; luobin@ahu.edu.cn Dictionary learning plays an important role in machine learning, where data vectors are modeled as a sparse linear combinations of basis factors (i.e., dictionary). However, how to conduct dictionary learning in noisy environment has not been well studied. Moreover, in practice, the dictionary (i.e., the lower rank approximation of the data matrix) and the sparse representations are required to be nonnegative, such as applications for image annotation, document summarization, microarray analysis. In this paper, we propose a new formulation for non-negative dictionary learning in noisy environment, where structure sparsity is enforced on sparse representation. The proposed new formulation is also robust for data with noises and outliers, due to a robust loss function used. We derive an efficient multiplicative updating algorithm to solve the optimization problem, where dictionary and sparse representation are updated iteratively. We prove the convergence and correctness of proposed algorithm rigorously. We show the differences of dictionary at different level of sparsity constraint. The proposed algorithm can be adapted for clustering and semi-supervised learning. Introduction In dictionary learning, a signal is represented as a sparse representation of basis factors (called dictionary), instead of predefined wavelets (Mallat 1999). Dictionary learning has shown the state of the art performance, and has many applications for image denoising (Elad and Aharon 2006)), face recognition (Protter and Elad 2009), document clustering, microarray analysis, etc. Recent researches (Raina et al. 2007; Delgado et al. 2003; Mairal et al. 2009; Olshausen and Fieldt 1997) have shown the sparsity helps to eliminate data redundancy, and capture the correlations inherent in data. Compared with Principal Component Analysis (PCA), dictionary learning does not have a strict constraint (such as orthogonal) on the basis vector, and thus the dictionary can be learned in a more flexible way. The key to dictionary learning, at different context with different constraints, is to solve the corresponding optimization problem. For example, different objective functions (Aharon, Elad, and Bruckstein 2006; Mairal et al. 2010) have been proposed to meet the requirement of specific applications, e.g., supervised dic- Copyright c 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. tionary learning (Mairal et al. 2008), a joint learning using dictionary learning and clustering-based sparse representation (Dong et al. 2011), online dictionary learning (Kasiviswanathan et al. 2011), tensor decomposition for image storage (Zhang and Ding 2013), etc. In this paper, we focus on a general non-negative dictionary learning problem in noisy environment, i.e., data could be noisy and have missing values. To summarize, the main contribution of this paper is in three-fold. (1) We formulate the non-negative dictionary learning problem in noisy environment through the optimization of a nonsmooth loss function over non-negative set with LASSOtype regularization term. (2) It is challenging to solve this problem due to the non-smoothness of reconstruction error term and sparsity regularization term. Different from the recent second order iterative algorithms (e.g., (Lee et al. 2007; Aharon, Elad, and Bruckstein 2006)) used for dictionary learning, we propose an efficient multiplicative updating algorithm, where the convergence and correctness of algorithm are rigorously proved. (3) As shown in experiment, our algorithm converges very fast. The learned sparse coding Y can be used for clustering and semi-supervised learning. Robust Dictionary Learning Objective In standard dictionary learning, given a set of training signals X = (x1, , xn), where xi 2

0 is a parameter. Note standard least square loss is used in Eq.(1), which implies Gaussian noises existed in input data signals. However, in real world, data measurement could be noisy and have missing values. It is known that least square loss is prone to noises and large deviations. Replacing the least square loss of Eq.(1) with more robust 1 loss, robust dictionary learning becomes, i ||xi Ayi||1 + ||yi||1, (2) Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence where dictionary A 2 C, where C is the feasible domain of problem, i.e., C = {A|A 0}, or C = {A|kajk2 1} (aj is j-th column of A). In real world problems (such as images features, text vector, etc), input data are non-negative values, which requires the dictionary to be non-negative, i.e., A 0. Naturally, the sparse representation yi for each signal xi should be nonnegative. Problem Formulation Thus, in this paper, we focus on feasible domain of A to be: C = {A|A 0}. Then objective of Eq.(2) becomes, i ||xi Ayi||1 + ||yi||1 + βk Ak2 F , s.t. A 0, yi 0. (3) Note smooth term k Ak2 F is added in Eq.(3) to avoid the trivial solution. In practice, we require β > 0. If β = 0, suppose (A , y i ) is the current optimal solution for Eq.(3), then we can always get a better solution (A , y i ) with smaller objective function value of Eq.(3), where A = A , y i = 1 y i and > 1. Using matrix formulation, let Y = [y1, y2, , yn], then Eq.(3) becomes, min Y,A ||X AY||1 + ||Y||1 + βk Ak2 F , s.t. A 0, Y 0, (4) where ||Y||1 = P ki |Yki|. By introducing Lagrangian multiplier to enforce the constraint, Eq.(4) can be equivalently expressed as, min Y,A ||X AY||1, s.t. A 0, Y 0, ||Y||1 q, k Ak2 F p. (5) The optimization of Eq.(4) is general non-convex. But if one of the variables (A or Y) is known, we can find the global optimal solution w.r.t the other variable. Note two non-smooth 1 terms are involved in Eq.(4), and thus it is a bit challenging to solve Eq.(4). However, it does not add any difficulty, because in Eq.(4), 1 term appeared together with non-negative constraint. Thus 1 term w.r.t sparse coding can be rewritten as, ||Y||1 = Tr(EY), where E 2