# discriminative_robust_transformation_learning__c74e440c.pdf Discriminative Robust Transformation Learning Jiaji Huang Qiang Qiu Guillermo Sapiro Robert Calderbank Department of Electrical Engineering, Duke University Durham, NC 27708 {jiaji.huang,qiang.qiu,guillermo.sapiro,robert.calderbank}@duke.edu This paper proposes a framework for learning features that are robust to data variation, which is particularly important when only a limited number of training samples are available. The framework makes it possible to tradeoff the discriminative value of learned features against the generalization error of the learning algorithm. Robustness is achieved by encouraging the transform that maps data to features to be a local isometry. This geometric property is shown to improve (K, ϵ)-robustness, thereby providing theoretical justification for reductions in generalization error observed in experiments. The proposed optimization framework is used to train standard learning algorithms such as deep neural networks. Experimental results obtained on benchmark datasets, such as labeled faces in the wild, demonstrate the value of being able to balance discrimination and robustness. 1 Introduction Learning features that are able to discriminate is a classical problem in data analysis. The basic idea is to reduce the variance within a class while increasing it between classes. One way to implement this is by regularizing a certain measure of the variance, while assuming some prior knowledge about the data. For example, Linear Discriminant Analysis (LDA) [4] measures sample covariance and implicitly assumes that each class is Gaussian distributed. The Low Rank Transform (LRT) [10], instead uses nuclear norm to measure the variance and assumes that each class is near a low-rank subspace. A different approach is to regularize the pairwise distances between data points. Examples include the seminal work on metric learning [17] and its extensions [5,6,16]. While great attention has been paid to designing objectives to encourage discrimination, less effort has been made in understanding and encouraging robustness to data variation, which is especially important when a limited number of training samples are available. One exception is [19], which promotes robustness by regularizing the traditional metric learning objective using prior knowledge from an auxiliary unlabeled dataset. In this paper we develop a general framework for balancing discrimination and robustness. Robustness is achieved by encouraging the learned data-to-features transform to be locally an isometry within each class. We theoretically justify this approach using (K, ϵ)-robustness [1,18] and give an example of the proposed formulation, incorporating it in deep neural networks. Experiments validate the capability to trade-off discrimination against robustness. Our main contributions are the following: 1) prove that locally near isometry leads to robustness; 2) propose a practical framework that allows to robustify a wide class of learned transforms, both linear and nonlinear; 3) provide an explicit realization of the proposed framework, achieving competitive results on difficult face verification tasks. The paper is organized as follows. Section 2 motivates the proposed study and proposes a general formulation for learning a Discriminative Robust Transform (DRT). Section 3 provides a theoretical justification for the framework by making an explicit connection to robustness. Section 4 gives a specific example of DRT, denoted as Euc-DRT. Section 5 provides experimental validation of Euc DRT, and section 6 presents conclusions. 1 2 Problem Formulation Consider an L-way classification problem. The training set is denoted by T = {(xi, yi)}, where xi Rn is the data and yi {1, . . . , L} is the class label. We want to learn a feature transform fα( ) such that a datum x becomes more discriminative when it is transformed to feature fα(x). The transform fα is parametrized by a vector α, a framework that includes linear transforms and neural networks where the entries of α are the learned network parameters. 2.1 Motivation The transform fα promotes discriminability by reducing intra-class variance and enlarging interclass variance. This aim is expressed in the design of objective functions [5, 10] or the structure of the transform fα [7, 11]. However the robustness of the learned transform is an important issue that is often overlooked. When training samples are scarce, statistical learning theory [15] predicts overfitting to the training data. The result of overfitting is that discrimination achieved on test data will be significantly worse than that on training data. Our aim in this paper is the design of robust transforms fα for which the training-to-testing degradation is small [18]. We formally measure robustness of the learned transform fα in terms of (K, ϵ)-robustness [1]. Given a distance metric ρ, a learning algorithm is said to be (K, ϵ)-robust if the input data space can be partitioned into K disjoint sets Sk, k = 1, ..., K, such that for all training sets T , the learned parameter αT determines a loss for which the value on pairs of training samples taken from different sets Sj and Sk is very close to the value of any pair of data samples taken from Sj and Sk. (K, ϵ)-robustness is illustrated in Fig. 1, where S1 and S2 are both of diameter γ and |e e | = |ρ(fα(x1), fα(x2)) ρ(fα(x 1), fα(x 2))|. If the transform fα preserves all distances within S1 and S2, then |e e | cannot deviate much from |d d | 2γ. Figure 1: (K, ϵ)-robustness: Here d = ρ(x1, x2), d = ρ(x 1, x 2), e = ρ(fα(x1), fα(x2)), and e = ρ(fα(x 1), fα(x 2)). The difference |e e | cannot deviate too much from |d d |. 2.2 Formulation and Discussion Motivated by the above reasoning, we now present our proposed framework. First we define a pair label ℓi,j 1 if yi = yj 1 otherwise . Given a metric ρ, we use the following hinge loss to encourage high inter-class distance and small intra-class distance. 1 |P| i,j P max {0, ℓi,j [ρ (fα(xi), fα(xj)) t(ℓi,j)]} , (1) Here P = {(i, j|i = j)} is the set of all data pairs. t(ℓi,j) 0 is a function of ℓi,j and t(1) < t( 1). Similar to metric learning [17], this loss function connects pairwise distance to discrimination. However traditional metric learning typically assumes squared Euclidean distance and here the metric ρ can be arbitrary. For robustness, as discussed above, we may want fα( ) to be distance-preserving within each small local region. In particular, we define the set of all local neighborhoods as NB {(i, j)|ℓi,j = 1, ρ(xi, xj) γ} . 1A note on the notations: matrices (vectors) are denoted in upper (lower) case bold letters. Scalars are denoted in plain letters. Therefore, we minimize the following objective function 1 |NB| (i,j) N B |ρ(fα(xi), fα(xj)) ρ(xi, xj)| . (2) Note that we do not need to have the same metric in both the input and the feature space, they do not even have in general the same dimension. With a slight abuse of notation we use the same symbol to denote both metrics. To achieve discrimination and robustness simultaneously, we formulate the objective function as a weighted linear combination of the two extreme cases in (1) and (2) λ |P| i,j P max {0, ℓi,j [ρ (fα(xi), fα(xj)) t(ℓi,j)]}+1 λ (i,j) N B |ρ(fα(xi), fα(xj)) ρ(xi, xj)| (3) where λ [0, 1]. The formulation (3) balances discrimination and robustness. When λ = 1 it seeks discrimination, and as λ decreases it starts to encourage robustness. We shall refer to a transform that is learned by solving (3) as a Discriminative Robust Transform (DRT). The DRT framework provides opportunity to select both the distance measure and the transform family. 3 Theoretical Analysis In this section, we provide a theoretical explanation for robustness. In particular, we show that if the solution to (1) yields a transform fα that is locally a near isometry, then fα is robust. 3.1 Theoretical Framework Let X denote the original data, let Y = {1, ..., L} denote the set of class labels, and let Z = X Y. The training samples are pairs zi = (xi, yi), i = 1, . . . , n drawn from some unknown distribution D defined on Z. The indicator function is defined as ℓi,j = 1 if yi = yj and 1 otherwise. Let fα be a transform that maps a low-level feature x to a more discriminative feature fα(x), and let F denote the space of transformed features. For simplicity we consider an arbitrary metric ρ defined on both X and F (the general case of different metrics is a straightforward extension), and a loss function g(ρ(fα(xi), fα(xj)), ℓi,j) that encourages ρ(fα(xi), fα(xj)) to be small (big) if ℓi,j = 1 ( 1). We shall require the Lipschtiz constant of g( , 1) and g( , 1) to be upper bounded by A > 0. Note that the loss function in Eq. (1) has a Lipschtiz constant of 1. We abbreviate g(ρ(fα(xi), fα(xj)), ℓi,j) hα(zi, zj). The empirical loss on the training set is a function of α given by Remp(α) 2 n(n 1) Pn i,j=1 i =j hα(zi, zj), (4) and the expected loss on the test data is given by R(α) Ez 1,z 2 D [hα(z 1, z 2)] . (5) The algorithm operates on pairs of training samples and finds parameters αT arg min α Remp(α), (6) that minimize the empirical loss on the training set T . The difference Remp R between expected loss on the test data and empirical loss on the training data is the generalization error of the algorithm. 3.2 (K, ϵ)-robustness and Covering Number We work with the following definition of (K, ϵ)-robustness [1]. Definition 1. A learning algorithm is (K, ϵ)-robust if Z = X Y can be partitioned into K disjoint sets Zk, k = 1, . . . , K such that for all training sets T Zn, the learned parameter αT determines a loss function where the value on pairs of training samples taken from sets Zp and Zq is very close to the value of any pair of data samples taken from Zp and Zq. Formally, assume zi, zj T , with zi Zp and zj Zq, if z i Zp and z j Zq, then hαT (zi, zj) hαT (z i, z j) ϵ. Remark 1. (K, ϵ)-robustness means that the loss incurred by a testing pair (z i, z j) in Zp Zq is very close to the loss incurred by any training pair (zi, zj) in Zp Zq. It is shown in [1] that the generalization error of (K, ϵ)-robust algorithms is bounded as R(αT ) Remp(αT ) ϵ + O Therefore the smaller ϵ, the smaller is the generalization error, and the more robust is the learning algorithm. Given a metric space, the covering number specifies how many balls of a given radius are needed to cover the space. The more complex the metric space, the more balls are needed to cover it. Covering number is formally defined as follows. Definition 2 (Covering number). Given a metric space (S, ρ), we say that a subset ˆS of S is a γ-cover of S, if for every element s S, there exists ˆs ˆS such that ρ(s,ˆs) γ. The γ-covering number of S is Nγ(S, ρ) = min{| ˆS| : ˆS is a γ-cover of S}. Remark 2. The covering number is a measure of the geometric complexity of (S, ρ). A set S with covering number Nγ/2(S, ρ) can be partitioned into Nγ/2(S, ρ) disjoint subsets, such that any two points within the same subset are separated by no more than γ. Lemma 1. The metric space Z = X Y can be partitioned into LNγ/2(X, ρ) subsets, denoted as Z1, . . . , ZLNγ/2(X,ρ), such that any two points z1 (x1, y1), z2 (x2, y2) in the same subset satisfy y1 = y2 and ρ(x1, x2) γ. Proof. Assuming the metric space (X, ρ) is compact, we can partition X into Nγ/2(X, ρ) subsets, each with diameter at most γ. Since Y is a finite set of size L, we can partition Z = X Y into LNγ/2(X, ρ) subsets with the property that two samples (x1, y1), (x2, y2) in the same subset satisfy y1 = y2 and ρ(x1, x2) γ. It follows from Lemma 1 that we may partition X into subsets X1, . . . , XLNγ/2(X,ρ), such that pairs of points x1, x2 from the same subset have the same label and satisfy ρ(xi, xj) γ. Before we connect local geometry to robustness we need one more definition. We say that a learned transform fα is a δ-isometry if the metric is distorted by at most δ: Definition 3 (δ-isometry). Let A, B be metric spaces with metrics ρA and ρB. A map f : A 7 B is a δ-isometry if for any a1, a2 A, |ρA(f(a1), f(a2)) ρB(a1, a2)| δ. Theorem 1. Let fα be a transform derived via Eq. (6) and let X1, . . . , XLNγ/2(X,ρ) be a cover of X as described above. If fα is a δ-isometry, then it is (LNγ/2(X, ρ), 2A(γ + δ))-robust. Proof sketch. Consider training samples zi, zj and testing samples z i, z j such that zi, z i Zp and zj, z j Zq for some p, q {1, . . . , LNγ/2(X, ρ)}. Then by Lemma 1, ρ(xi, x i) γ and ρ(xj, x j) γ, yi = y i and yj = y j, and xi, x i Xp and xj, x j Xq. By definition of δ-isometry, |ρ(fαT (xi), fαT (x i)) ρ(xi, x i)| δ and |ρ(fαT (xj), fαT (x j)) ρ(xj, x j)| δ. Rearranging the terms gives ρ(fαT (xi), fαT (x i)) ρ(xi, x i)+δ γ +δ and ρ(fαT (xj), fαT (x j)) ρ(xj, x j)+δ γ +δ. Figure 2: Proof without words. In order to bound the generalization error, we need to bound the difference between ρ(fαT (xi), fαT (xj)) and ρ(fαT (x i), fαT (x j)). The details can be found in [9]; here we appeal to the proof schematic in Fig. 2. We need to bound |e e | and it cannot exceed twice the diameter of a local region in the transformed domain. Robustness of the learning algorithm depends on the granularity of the cover and the degree to which the learned transform fα distorts distances between pairs of points in the same covering subset. The subsets in the cover constitute regions where the local geometry makes it possible to bound generalization error. It now follows from [1] that the generalization error satisfies R(αT ) Remp(αT ) 2A(γ + δ) + O q n . The DRT proposed here is a particular example of a local isometry, and Theorem 1 explains why the generalization error is smaller than that of pure metric learning. The transform described in [9] partitions the metric space X into exactly L subsets, one for each class. The experiments reported in Section 5 demonstrate that the performance improvements derived from working with a finer partition can be worth the cost of learning finer grained local regions. 4 An Illustrative Realization of DRT Having justified robustness, we now provide a realization of the proposed general DRT where the metric ρ is Euclidean distance. We use Gaussian random variables to initialize α, then, on the randomly transformed data, we set t(1) (t( 1)) to be the average intra-class (inter-class) pairwise distance. In all our experiments, the solution satisfied the condition t(1) < t( 1) required in Eq. (1). We calculate the diameter γ of the local regions NB indirectly, using the κ-nearest neighbors of each training sample to define a local neighborhood. We leave the question of how best to initialize the indicator t and the diameter γ for future research. We denote this particular example as Euc-DRT and use gradient descent to solve for α. Denoting the objective by J, we define yi fα(xi), δi,j fα(xi) fα(xj), and ρ0 i.j xi xj . Then J yi = X (i,j) P ℓi,j( δi,j t(ℓi,j))>0 λ |P| ℓi,j δi,j δi,j + X 1 λ |NB| sgn( δi,j ρ0 i,j) δi,j δi,j . (8) In general, fα defines a D-layer neural network (when D = 1 it defines a linear transform). Let α(d) be the linear weights at the d-th layer, and let x(d) be the output of the d-th layer, so that yi = x(D) i . Then the gradients are computed as, α(D) , and J α(d) = X x(d+1) i x(d+1) i x(d) i x(d) i α(d) for 1 d D 1. (9) Algorithm 1 provides a summary, and we note that the extension to stochastic training using minbatches is straightforward. 5 Experimental Results In this section we report on experiments that confirm robustness of Euc-DRT. Recall that empirical loss is given by Eq. (4) where α is learned as αT from the training set T , and |T | = N. The generalization error is R Remp where the expected loss R is estimated using a large test set. 5.1 Toy Example This illustrative example is motivated by the discussion in Section 2.1. We first generate a 2D dataset consisting of two noisy half-moons, then use a random 100 2 matrix to embed the data in a 100-dimensional space. We learn a linear transform fα that maps the 100 dimensional data to 2 dimensional features, and we use κ = 5 nearest neighbors to construct the set NB. We consider λ = 1, 0.5, 0.25, representing the most discriminative, balanced, and more robust scenarios. When λ = 1 the transformed training samples are rather discriminative (Fig. 3a), but when the transform is applied to testing data, the two classes are more mixed (Fig. 3d). When λ = 0.5, the Algorithm 1 Gradient descent solver for Euc-DRT Input: λ [0, 1], training pairs {(xi, xj, ℓi,j)}, a pre-defined D-layer network (D = 1 as linear transform), stepsize η, neighborhood size κ. Output: α 1: Randomly initialize α, compute yi = fα(xi). 2: On the yi, compute the average intra and inter-class pairwise distances, assign to t(1), t( 1) 3: For each training datum, find its κ nearest neighbor and define the set NB. 4: while stable objective not achieved do 5: Compute yi = fα(xi) by a forward pass. 6: Compute objective J. 7: Compute J yi as Eq. (8). 8: for l = D down to 1 do 9: Compute J α(d) as Eq. (9). 10: α(d) α(d) η J α(d) . 11: end for 12: end while -20 0 20 -30 (a) λ = 1 Transformed training samples. (discriminative case) -20 0 20 -30 (b) λ = 0.5 transformed training samples. (balanced case) -20 0 20 -30 (c) λ = 0.25 Transformed training samples. (robust case) -20 0 20 -30 (d) λ = 1 Transformed testing samples. (discriminative case) -20 0 20 -30 (e) λ = 0.5 transformed testing samples. (balanced case) -20 0 20 -30 (f) λ = 0.25 Transformed testing samples. (robust case) Figure 3: Original and transformed training/testing samples embedded in 2-dimensional space with different colors representing different classes. transformed training data are more dispersed within each class (Fig. 3b), hence less easily separated than when λ = 1. However Fig. 3e shows that it is easier to separate the two classes on the test data. When λ = 0.25, robustness is preferred to discriminative power as shown in Figs. 3c and 3f. Tab. 1 quantifies empirical loss Remp, generalization error, and classification performance (by 1-nn) for λ = 1, 0.5 and 0.25. As λ decreases, Remp increases, indicating loss of discrimination on the training set. However, generalization error decreases, implying more robustness. We conclude that by varying λ, we can balance discrimination and robustness. 5.2 MNIST Classfication Using a Very Small Training Set The transform fα learned in the previous section was linear, and we now apply a more sophisticated convolutional neural network to the MNIST dataset. The network structure is similar to Le Net, and is Table 1: Varying λ on a toy dataset. λ 1 0.5 0.25 Remp 1.5983 1.6025 1.9439 generalization error 10.5855 9.5071 8.8040 1-nn accuracy 92.20% 98.30% 91.55% (original data 93.35%) Table 2: Classification error on MNIST. Training/class 30 50 70 100 original pixels 81.91% 86.18% 86.86% 88.49% Le Net 87.51% 89.89% 91.24% 92.75% DML 92.32% 94.45% 95.67% 96.19% Euc-DRT 94.14% 95.20% 96.05% 96.21% Table 3: Implementation details of the neural network for MNIST classification. name parameters conv1 size: 5 5 1 20 stride: 1, pad: 0 pool1 size: 2 2 conv2 size: 5 5 20 50 stride: 1, pad: 0 pool2 size: 2 2 conv3 size: 4 4 50 128 stride: 1, pad: 0 made up of alternating convolutional layers and pooling layers, with parameters detailed in Table 3. We map the original 784-dimensional pixel values (28x28 image) to 128-dimensional features. While state-of-art results often use the full training set (6,000 training samples per class), here we are interested in small training sets. We use only 30 training samples per class, and we use κ = 7 nearest neighbors to define local regions in Euc-DRT. We vary λ and study empirical error, generalization error, and classification accuracy (1-nn). We observe in Fig. 4 that when λ decreases, the empirical error also decreases, but that the generalization error actually increases. By balancing between these two factors, a peak classification accuracy is achieved at λ = 0.25. Next, we use 30, 50, 70, 100 λ 0 0.25 0.5 0.75 1 λ 0 0.25 0.5 0.75 1 λ 0 0.25 0.5 0.75 1 1-nn accuracy(%) Figure 4: MNIST test: with only 30 training samples per class. We vary λ and assess (a) Remp; (b) generalization error; and (c) 1-nn classification accuracy. Peak accuracy is achieved at λ = 0.25. training samples per class and compare the performance of Euc-DRT with Le Net and Deep Metric Learning (DML) [7]. DML minimizes a hinge loss on the squared Euclidean distances. It shares the same spirit with our Euc-DRT using λ = 1. All methods use the same network structure, Tab. 3, to map to the features. For classification, Le Net uses a linear softmax classifier on top of the conv3 layer and minimizes the standard cross-entropy loss during training. DML and Euc-DRT both use a 1-nn classifier on the learned features. Classification accuracies are reported in Tab. 2. In Tab. 2, we see that all the learned features improve upon the original ones. DML is very discriminative and achieves higher accuracy than Le Net. However, when the training set is very small, robustness becomes more important and Euc-DRT significantly outperforms DML. 5.3 Face Verification on LFW We now present face verification on the more challenging Labeled Faces in the Wild (LFW) benchmark, where our experiments will show that there is an advantage to balancing disciminability and robustness. Our goal is not to reproduce the success of deep learning in face verification [7, 14], but to stress the importance of robust training and to compare the proposed Euc-DRT objective with popular alternatives. Note also that it is difficult to compare with deep learning methods when training sets are proprietary [12 14]. We adopt the experimental framework used in [2], and train a deep network on the WDRef dataset, where each face is described using a high dimensional LBP feature [3] (available at 2) that is reduced to a 5000-dimensional feature using PCA. The WDRef dataset is significantly smaller than the proprietary datasets typical of deep learning, such as the 4.4 million labeled faces from 4030 individuals in [14], or the 202,599 labeled faces from 10,177 individuals in [12]. It contains 2,995 subjects with about 20 samples per subject. We compare the Euc-DRT objective with Deep Face (DF) [14] and Deep Metric Learning (DML) [7], two state-of-the-art deep learning objectives. For a fair comparison, we employ the same network structure and train on the same input data. Deep Face feeds the output of the last network layer to an L-way soft-max to generate a probability distribution over L classes, then minimizes a cross entropy loss. The Euc-DRT feature fα is implemented as a two-layer fully connected network with tanh as the squash function. Weight decay (conventional Frobenius norm regularization) is employed in both DF and DML, and results are only reported for the best weight decay factor. After a network is trained on WDRef, it is tested on the LFW benchmark. Verification simply consists of comparing the cosine distance between a given pair of faces to a threshold. Fig. 5 displays ROC curves and Table 4 reports area under the ROC curve (AUC) and verification accuracy. High-Dim LBP refers to verification using the initial LBP features. Deep Face (DF) optimizes for a classification objective by minimizing a softmax loss, and it successfully separates samples from different classes. However the constraint that assigns similar representations to the same class is weak, and this is reflected in the true positive rate displayed in Fig. 5. In Deep Metric Learning (DML) this same constraint is strong, but robustness is a concern when the training set is small. The proposed Euc-DRT improves upon both DF and DML by balancing disciminability and robustness. It is less conservative than DF for better discriminability, and more responsive to local geometry than DML for smaller generalization error. Face verification accuracy for Euc-DRT was obtained by varying the regularization parameter λ between 0.4 and 1 (as shown in Fig 6), then reporting the peak accuracy observed at λ = 0.9. 0 0.5 1 0.5 HD-LBP deep Face DML Euc-DRT Figure 5: Comparison of ROCs for all methods λ 0.4 0.6 0.8 1 verification accuracy (%) Figure 6: Verification accuracy of Euc-DRT as λ varies Table 4: Verification accuracy and AUCs on LFW Method Accuracy AUC (%) ( 10 2) HD-LBP 74.73 82.22 1.00 deep Face 88.72 95.50 0.29 DML 90.28 96.74 0.33 Euc-DRT 92.33 97.77 0.25 6 Conclusion We have proposed an optimization framework within which it is possible to tradeoff the discriminative value of learned features with robustness of the learning algorithm. Improvements to generalization error predicted by theory are observed in experiments on benchmark datasets. Future work will investigate how to initialize and tune the optimization, also how the Euc-DRT algorithm compares with other methods that reduce generalization error. 7 Acknowledgement The work of Huang and Calderbank was supported by AFOSR under FA 9550-13-1-0076 and by NGA under HM017713-1-0006. The work of Qiu and Sapiro is partially supported by NSF and Do D. 2http://home.ustc.edu.cn/chendong/ [1] A. Bellet and A. Habrard. Robustness and generalization for metric learning. Neurocomputing, 151(14):259 267, 2015. [2] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian face revisited: A joint formulation. In European Conference on Computer Vision (ECCV), 2012. [3] D. Chen, X. Cao, F. Wen, and J. Sun. Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013. [4] K. Fukunaga. Introduction to Statistical Pattern Recognition. San Diego: Academic Press, 1990. [5] A. Globerson and S. Roweis. Metric learning by collapsing classes. In Advances in Neural Information Processing Systems (NIPS), 2005. [6] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In Advances in Neural Information Processing Systems (NIPS), 2004. [7] J. Hu, J. Lu, and Y. Tan. Discriminative deep metric learning for face verification in the wild. In Computer Vision and Pattern Recognition (CVPR), pages 1875 1882, 2014. [8] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 0749, University of Massachusetts, Amherst, October 2007. [9] J. Huang, Q. Qiu, R. Calderbank, and G. Sapiro. Geometry-aware deep transform. In International Conference on Computer Vision, 2015. [10] G. Sapiro Q. Qiu. Learning transformations for clustering and classification. Journal of Machine Learning Research (JMLR), pages 187 225, 2015. [11] C. Sumit, R. Hadsell, and Y. Le Cun. Learning a similarity metric discriminatively, with application to face verification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 539 546, 2005. [12] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identification-verification. In Advances in Neural Information Processing Systems (NIPS), pages 1988 1996, 2014. [13] Y. Sun, X. Wang, and X. Tang. Deep learning face representation from predicting 10,000 classes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1891 1898, 2014. [14] Y. Taigman, M. Yang, M. A. Ranzato, and L. Wolf. Deepface: Closing the gap to humanlevel performance in face verification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1701 1708, 2014. [15] V. N. Vapnik. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10(5):988 999, 1999. [16] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10:207 244, 2009. [17] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems (NIPS), 2002. [18] H. Xu and S. Mannor. Robustness and generalization. Machine Learning, 86(3):391 423, 2012. [19] Z. Zha, T. Mei, M. Wang, Z. Wang, and X. Hua. Robust distance metric learning with auxiliary knowledge. In International Joint Conference on Artificial Intelligence (IJCAI), 2009.