# seven_deep_semisupervised_verification_networks__adeb71a3.pdf

SEVEN: Deep Semi-supervised Veriﬁcation Networks

Vahid Noroozi , Lei Zheng , Sara Bahaadini , Sihong Xie , Philip S. Yu

Department of Computer Science, University of Illinois at Chicago, IL, USA Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL, USA Computer Science and Engineering Department, Lehigh University, Bethlehem, PA, USA {vnoroo2,lzheng21}@uic.edu, sara.bahaadini@u.northwestern.edu, sxie@cse.lehigh.edu, psyu@uic.edu

Veriﬁcation determines whether two samples belong to the same class or not, and has important applications such as face and ﬁngerprint veriﬁcation, where thousands or millions of categories are present but each category has scarce labeled examples, presenting two major challenges for existing deep learning models. We propose a deep semisupervised model named SEmi-supervised VEriﬁcation Network (SEVEN) to address these challenges. The model consists of two complementary components. The generative component addresses the lack of supervision within each category by learning general salient structures from a large amount of data across categories. The discriminative component exploits the learned general features to mitigate the lack of supervision within categories, and also directs the generative component to ﬁnd more informative structures of the whole data manifold. The two components are tied together in SEVEN to allow an end-to-end training of the two components. Extensive experiments on four veriﬁcation tasks demonstrate that SEVEN signiﬁcantly outperforms other state-of-the-art deep semisupervised techniques when labeled data are in short supply. Furthermore, SEVEN is competitive with fully supervised baselines trained with a larger amount of labeled data. It indicates the importance of the generative component in SEVEN.

1 Introduction

Different from traditional classiﬁcation tasks, the goal of veriﬁcation tasks is to determine whether two samples belong to the same class or not, without predicting the class directly [Chopra et al., 2005]. Veriﬁcation tasks arise from applications where thousands or millions of classes are present with very few samples within each category (in some cases just one). For example, in face and signature veriﬁcation, faces and signatures of a person are considered to belong to a class. While there can be millions of persons in the database, very few examples for each person are available. In such applications, it is also necessary to handle new classes without

the need to train the model from the scratch. It is not trivial to address such challenges with traditional classiﬁcation techniques. Motivated by the impressive performance brought by deep networks to many machine learning tasks [Le Cun et al., 2015; Bahaadini et al., 2017; Zheng et al., 2017], we pursue a deep learning model to improve existing veriﬁcation models. However, deep networks require a large amount of labeled data for each class, which are not readily available in veriﬁcation. There are semi-supervised training methods for deep network to tap on the large amount of unlabeled data. These semi-supervised methods usually have separate learning stages [Sun et al., 2017; Nair and Hinton, 2010]. They ﬁrst pre-train a model using unlabeled data and then ﬁne-tune the model with labeled data to ﬁt the target tasks. Such twophase methods are not suitable for veriﬁcation. First, the large number of classes and the lack of data (be it labeled or unlabeled) within each category prohibit us from any form of within class pre-training and ﬁne-tuning. Second, if we pool data from all categories for pre-training, the learned features are general but not speciﬁc towards each category, and the later ﬁne-tuning within each category may not be able to correct such bias due to the lack of labeled data. To address such challenges, we propose Deep SEmisupervised VEriﬁcation Networks (SEVEN) that consists of a generative and a discriminative component to learn general and category speciﬁc representations from both unlabeled and labeled data simultaneously. We cross the category barrier and pool unlabeled data from all categories to learn salient structures of the data manifold. The hope is that by tapping on the large amount of unlabeled data, the structures that are shared by all categories can be learned for veriﬁcation. SEVEN then adapts the general structures to each category by attaching the generative component to the discriminative component that uses the labeled data to learn categoryspeciﬁc features. In this sense, the generative component works as a regularizer for the discriminative component, and aids in exploiting the information hidden in the unlabeled data. On the other hand, as the discriminative component depends on the structures learned by the generative component, it is desirable to inform the generative component about the subspace that is beneﬁcial to the ﬁnal veriﬁcation tasks. Towards this end, instead of training the two components separately or sequentially, SEVEN chooses to train the two com-

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

ponents simultaneously and allow the generative component to learn more informative general features. We evaluate SEVEN on four datasets and compare it to four state-of-the-art semi-supervised and supervised algorithms. Experimental results demonstrate that SEVEN outperforms all the baselines in terms of accuracy. Furthermore, it has shown that by using very small amount of labeled examples, SEVEN reaches competitive performance with the supervised baselines trained on a signiﬁcantly larger set of labeled data. The rest of this paper is organized as follows. In Section 2 we give an overview of the related works. In Section 3 we present SEVEN in detail. Section 4 gives the experimental evaluation and analysis of the proposed model, followed by a conclusion.

2 Related Work

SEVEN can serve as a metric learning algorithm that is commonly employed in veriﬁcation. The goal of metric learning is to learn a distance metric such that samples in any negative pair are far away and those in any positive pair are close. Many of the existing approaches [Sun et al., 2017; Oh Song et al., 2016; Zagoruyko and Komodakis, 2015; Hu et al., 2014] learn a linear or nonlinear transformation that maps the data to a new space where the distance metric satisﬁes the above requirements. However, these methods do not address the large number of categories with scarce supervision information. One of the earliest works in neural network-based veriﬁcation is proposed by Bromley et al. for signature veriﬁcation [Bromley et al., 1993]. The proposed architecture, named Siamese networks, uses a contrastive objective function to learn a distance metric with Convnets. Similar approaches are employed for many other tasks such as face veriﬁcation or re-identiﬁcation [Koch, 2015; Sun et al., 2014; Wolf, 2014]. It is worthy to mention that all these works are supervised and do not exploit unlabeled data. Great interest in deep semi-supervised learning has emerged in applications where unlabeled data are abundant but obtaining labeled data is expensive or not feasible [Li et al., 2016; Hoffer and Ailon, 2016; Rasmus et al., 2015; Kingma et al., 2014; Lee, 2013]. However, most of such approaches are designed for classiﬁcation. To the best of our knowledge, there exists no deep semi-supervised learning to address the above two challenges in veriﬁcation. A key difference between SEVEN and most of the previous semi-supervised deep networks lies in the way that unlabeled and labeled data are exploited. Lee [Lee, 2013] has presented a semi-supervised approach for classiﬁcation tasks called Pseudo-Label based on self-training scheme. It predicts the labels of unlabeled samples by training the model with the available labeled samples. Then they bootstrap the model with the highly conﬁdent labeled samples. This approach is prone to error because it may reinforce wrong predictions especially in problems with low conﬁdent estimation. A more common semi-supervised approach is to pre-train a model with unlabeled samples and then the learned model is ﬁne-tuned using the labeled samples. For example, [Nair

and Hinton, 2010] have pre-trained a Restricted Boltzmann Machine (RBM) with noisy rectiﬁed linear units (NRe LU) in the hidden layers, then they used the learned weights to initialize and train a siamese network [Chopra et al., 2005] in a supervised way. The problem with pre-training based approaches is that the supervised part of the algorithm can ignore or lose what the model has learned in the unsupervised step. Another problem with pre-training based approaches is that they still need enough labeled examples for the ﬁnetunning step. Recently, some works have tried to alleviate such problems by performing the learning process from all the labeled and unlabeled data in a joint manner for classiﬁcation tasks [Li et al., 2016; Hoffer and Ailon, 2016; Maaløe et al., 2016; Rasmus et al., 2015]. They make the unsupervised model involved in the learning as a regularizer for the supervised model. It should be considered that all such techniques are designed for classiﬁcation tasks and can not handle the cases mentioned in the introduction such as the few samples per each class and the high number of classes. Another line of work that handles a large number of categories is extreme multi-label learning [Xie and Philip, 2017]. The most popular assumption is that all classes have sufﬁcient amount of labeled data, and this is clearly different from our problem setting. Recently, there are methods focusing on predicting the tail labels [Jain et al., 2016], but they are proposed for traditional classiﬁcation task and can not handle new classes in the test data.

3 Proposed Model

3.1 Problem Formulation

The training set is represented as X = (xi 1, xi 2) N i=1, where (xi 1, xi 2) is a pair of training samples xi j Rm, N = L + U is the total number of training pairs consisting of L labeled and U unlabeled pairs. The relation, i.e., label set denoted by Y = {yi|yi {pos, neg}}L i=1 speciﬁes the relation between the samples of each pair. A positive relation indicates that two samples of the pair belong to the same class and a negative relation indicates the opposite. The relations for the unlabeled pairs are unknown. Our goal is to learn a nonlinear function rθe(x1, x2): Rm Rm {pos, neg} parameterized by θe that predicts the relation between the two data samples x1 and x2. In other words, function rθe(x1, x2) veriﬁes if two samples are similar or not. We deﬁne rθe(., .) based on the distance of x1 and x2 estimated by a metric distance function as:

rθe(x1, x2) = neg if dθe(x1, x2) > τ pos if dθe(x1, x2) τ (1)

where dθe(., .) is the metric distance function and threshold τ speciﬁes the maximum distance that samples of a class are allowed to have. We deﬁne a nonlinear embedding function fθe(.) that projects data to a new feature space and dθe(x1, x2) = fθe(x1) fθe(x2) 2 is the Euclidean distance between x1 and x2 in the new space. An arbitrary distance function can be also used instead of the Euclidean distance.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

3.2 Model Description Our proposed model consists of discriminative and generative components. The model learns a non-linear function for each component. For the discriminative component, the nonlinear embedding function fθe() is learned to yield discriminative and informative representation. In a discriminative feature space, similar samples are mapped close to each other while dissimilar pairs are far from each other. Such property is crucial for a good metric function. The generative component of the model is designed to exploit the information hidden in the unlabeled data. The desired representation should keep the salient structures shared by all categories as much as possible. We deﬁne a probabilistic framework of the problem along with the discriminative and generative modelings of our algorithm. The conditional probability distribution of the relation variable y given the ith pair can be estimated as:

p(yi|xi 1, xi 2) = 1 tanh(dθe(x1, x2)) (2)

which can be written as the following.

p(yi|xi 1, xi 2) = 2 1 + exp(2dθe(x1, x2)) (3)

Here we use a tanh function to map the distance between samples to [0, 1]. However, any monotonic increasing function u(.) which gives u(0) = 1 and u(inf) = 0 can be also used for this purpose. We deﬁne p as the ground truth distribution to be approximated by p in (3) as p(yi|xi 1, xi 2) = 1 if yi = pos, and p(yi|xi 1, xi 2) = 0 otherwise. In the rest of the paper, the conditional distributions p(yi|xi 1, xi 2) and p(yi|x1 i , x2 i ) are denoted by pi and pi, respectively. Due to the probabilistic nature of such distributions, we approximate p with p by minimizing the Kullback-Leibler divergence between them and introduce the following discriminative loss function LD(X, Y; θe) deﬁned over all the labeled pairs as:

LD(X, Y; θe) =

i=1 ld(xi 1, xi 2; θe) =

i=1 KL( pi pi) (4)

where ld(xi 1, xi 2; θe) denotes the discriminative loss for the ith pair, and KL( pi pi) denotes the KL-divergence between pi and pi. KL( pi pi) can be substituted by H(pi, pi) H( pi) where H(pi) speciﬁes the entropy of pi, and H(pi, pi) deﬁnes the cross entropy between pi and pi. Considering that the loss function is optimized with a gradient based optimization approach and H( pi) is a constant with respect to the the parameters, we simplify the discriminative loss function as:

ld(xi 1, xi 2; θe) = I{yi = pos}log(pi) I{yi = neg}log(1 pi) (5)

where I{.} is the identity function. The loss function becomes equivalent to the the cross entropy over pi and pi. It penalizes large distance (similarity) between samples from the same (different) class to make the new space discriminative. LD attains its minimum when pi = pi over all the labeled pairs.

Figure 1: The overall architecture of SEVEN.

To alleviate the insufﬁciency of the unlabeled data for veriﬁcation task, through generative modeling, we encourage the embedding function fθe(.) to learn the salient structures shared by all categories. We deﬁne a nonlinear function gθd(.) parametrized by θd to project back the samples from new representation obtained from fθe(.) to the original feature space. The generative loss for the ith pair (xi 1, xi 2) is deﬁned as the reconstruction error between the original input and the corresponding reconstructed output as:

jg(xi 1, xi 2; θe, θd) = gθd(fθe(xi 1)) xi 1 2+

gθd(fθe(xi 2)) xi 2 2 (6)

where gθd(fθe(xi j)) indicates the reconstruction of the input of the xi j and is denoted by ˆxi j. The generative loss function LG, over all pairs including labeled and unlabeled, is deﬁned as:

LG(X; θe, θd) =

i=1 lg(xi 1, xi 2; θe, θd) (7)

We combine the generative and discriminative components into a uniﬁed objective function and write the optimization problem of SEVEN as:

L(X, Y; θe, θd) =

i=1 ld(xi 1, xi 2; θe)+

i=1 lg(xi 1, xi 2; θe, θd) + β( θe 2 + θd 2) (8)

where θe and θd are the regularization terms on the parameters of the functions fθe(.) and gθd(.). The parameter β controls the effect of this regularization, parameter α controls the trade off between the discriminative and generative objectives.

3.3 Model Architecture and Optimization We choose deep neural networks for parameterizing fθe(.) and gθd(.). The schematic representation of SEVEN is illustrated in Figure 1. The input pair is given to two neural networks denoted by F1 and F2 with shared parameters θe. They represent the discriminative component of the SEVEN (nonlinear embedding function fθe(.)). They project the input samples to the discriminative feature space. A layer, denoted

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

by d, is added on top of the networks F1 and F2 that estimates the distance between the two samples of the input pair in the discriminative space. It can be considered as the metric distance function dθe(., .) which networks F1 and F2 are supposed to learn. The ﬁnal layers of F1 and F2 are connected to two other subnetworks denoted by G1 and G2 in Figure 1 with shared parameters θd. They model the generative component of SEVEN (gθd(.)). They project back the samples to the original space. In other words, they can be considered as decoders for the encoders F1 and F2. The outputs of G1 and G2 shown as ˆx1 and ˆx2 are the reconstructions of the corresponding inputs x1 and x2. Subnetworks F1 and F2 are Conv Nets built with convolutional and max-pooling layers. G1 and G2 are made with transposed convolutional and upsampling layers which perform the reverse operations of convolutional and max-pooling layers, respectively. More detail of the transposed convolutional layer can be found in [Dumoulin and Visin, 2016]. The complete speciﬁcations of the models are presented in Table 1. The whole model is trained using backpropagation with respect to the objective function in Equation 8. Given a set of N pairs, we optimize the model through an adaptive version of gradient descent called RMSprop [Dauphin et al., 2015] over shufﬂed mini-batches. We employ l2-regularization and dropout [Srivastava et al., 2014] strategy to the convolutional and fully connected layers of the subnetworks to prevent overﬁtting. Batch normalization [Ioffe and Szegedy, 2015] technique is also applied after each convolutional layer to normalize the output of each layer. It can improve the performance in some cases. The training procedure of SEVEN is illustrated in Algorithm 1.

Algorithm 1: Training procedure of SEVEN

Input: Training set: X = (xi 1, xi 2) N i=1, label set Y = {yi}L i=1, number of iterations T, and batch size m. Output: Model s parameters: Θ B = |X|

m ; // number of batches Randomly split the training set X into B batches; for t = 1, 2, , T do

for b = 1, 2, , B do

Feedforward propagation of the bth batch; Calculate Lb according to Equation (8); Estimate gradients Lb

Θt by back propagation; Calculate Θt+1 using RMSProp; end end return ΘT ;

4 Experiments

4.1 Datasets We evaluate the proposed algorithm on the following four datasets.

MNIST: It is a dataset of 70000 grayscale images of handwritten digits from 0 to 9. We use the original split of 60000/10000 for the training and test sets. A uniform random noise of [0, 1] is added to each pixel to make it noisy and more challenging. US Postal Service (USPS) [Hull, 1994]: It is a dataset of 9298 handwritten digits automatically scanned from envelopes by the US Postal Service. All images are normalized to 16 16 grayscale images. We selected randomly 85% of the images for the training set. Labeled Faces in the Wild (LFW) [Huang et al., 2012]: It is a database of face images that contains 1100 positive and 1100 negative pairs in the training set, and 500 positive and 500 negative pairs in the test set. All images are resized to 64 48. Biosecur ID-SONOF (SONOF) [Galbally et al., 2015]: We use a subset of this dataset comprising signatures collected from 132 users, each user has 16 signatures. Signature images are normalized and converted to 80 80 grayscale images. We divided the users randomly into 100/32 for the training and test purposes. In SONOF and LFW datasets, classes in the training and test samples are disjoint, while in MNIST and USPS classes are common between test and train sets. The samples of LFW are already in the form of pairs. For other datasets, we create the pairs by ﬁrst splitting samples into two distinct sets for the training and test. We split the train set randomly into labeled and unlabeled samples. Then, each sample gets paired with two other samples randomly. One sample is selected from the same class to form a positive pair, and another one from a different class to form a negative pair.

4.2 Baselines

We compare the performance of SEVEN with the following baselines. It should be considered that we can not compare SEVEN with classiﬁcation techniques because they are not usually designed to handle new classes in the test data which happens in veriﬁcation applications. Since there are no other deep semi-supervised works for veriﬁcation tasks, we adopt the common deep semi-supervised techniques to veriﬁcation networks as our baselines. Discriminative Deep Metric Learning (DDML) [Hu et al., 2014]: They developed a deep neural network that learns a set of hierarchical transformations to project pairs into a common space by using a contrastive loss function. It is a supervised approach and can not use unlabeled data. Pseudo-Label [Lee, 2013]: It is a semi-supervised approach for training deep neural networks. It initially trains a supervised model with the labeled samples. Then it labels the unlabeled samples with the current trained model before each iteration, and use the high conﬁdence ones along with the labeled samples for training in the next iteration. We followed the same approach for training a siamese network [Bell and Bala, 2015] to extend their approach to the veriﬁcation tasks. Convolutional Autoencoder + Siamese Network (Pre Conv Sia): We pre-train a siamese network [Bell and Bala, 2015; Chopra et al., 2005] with an convolutional autoencoder model [Masci et al., 2011]. Then we ﬁne-tune the network

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Table 1: Models speciﬁcations for different datasets. BN: batch normalization, Re Lu: rectiﬁed linear activation function, Conv: convolutional layer, Trans Conv: transposed convolutional layer, Upsampling: Upsampling layer, Dense Layer: fully connected layer, and Max-pooling: max-pooling layer.

MNIST and USPS Network Fi Network Gi MNIST: 28 28 USPS:16 16 Input 128 1

3 3 Conv (8) Dense Layer Re Lu Re Lu 2 2 Max-pooling Reshape layer Dropout(0.5) 2 2 Upsampling 5 5 Conv (8) 5 5 Trans Conv (8) Re Lu Re Lu 2 2 Max-pooling Dropout(0.5) Dropout(0.5) 2 2 Upsampling Dense Layer (128 1) 3 3 Trans Conv (1) Re Lu Sigmoid Dropout(0.5)

LFW and SONOF Network Fi Network Gi LFW: 64 48 SONOF: 100 100 Input 128 1

4 4 Conv (32) Dense Layer BN-Re Lu Re Lu Dropout(0.5) Reshape layer 2 2 Max-pooling 3 3 Trans Conv (64) Dropout(0.5) BN-Re Lu 3 3 Conv (64) Dropout(0.5) BN-Re Lu 2 2 Upsampling 2 2 Max-pooling 3 3 Trans Conv (32) Dropout(0.5) BN-Re Lu 3 3 Conv (128) Dropout(0.5) BN-Re Lu 2 2 Upsampling Dropout(0.5) 3 3 Trans Conv (1) Dense Layer (128 1) BN-Sigmoid Re Lu Dropout(0.5)

with labeled pairs. The network uses Conv Nets as the underlying network for the modeling. Autoencoder + Siamese Network (Pre Auto Sia): It is similar to Pre Conv Sia, but uses MLP as the underlying network for the modeling. It is signiﬁcantly faster in training compared to Pre Conv Sia. Principle Component Analysis (PCA): We use PCA as an unsupervised feature learning technique. The distance between samples in the new space learned by PCA indicates their relations. The threshold on the distance is selected for each dataset separately based on the performance on the training data.

4.3 Experimental Settings The architectures of SEVEN for all datasets are presented in Table 1. All the parameters of SEVEN and also other baselines are selected based on a validation on a randomly selected 20% subset of the training data. The l2-regularization

parameter β is selected from {1e 4, 1e 3, 0.01, 0.1} for each dataset separately. The parameter α that controls the trade-off between generative and discriminative objectives is selected from {0.01, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0}. It is set to 0.05, 0.1, 0.05, and 0.2 for MNIST, LFW, USPS and SONOF, respectively. Parameter τ is set to 0.5 for all the four datasets. All the neural network models are trained for 150 epochs. The pre-training is also performed for 150 epochs for the baselines which require pre-training. RMSProp optimizer is used for the training of all the neural networks with the default value λ = 0.001 recommended in the original paper.

4.4 Performance Evaluation We report the performance in terms of accuracy which is the number of pairs in the test set veriﬁed correctly divided by the total number of pairs in the test set. The performance of SEVEN and all baselines are presented in Tables 2 and 3. The results are reported for different number of labeled pairs and the best accuracy for each case is depicted in bold. The last column indicates the case where all labeled training pairs are used. PCA is a fully unsupervised method, thus one performance is reported for each dataset. As it can be seen from the tables, SEVEN outperforms other baselines in cases where a limited number of labeled pairs are used and the differences in performance are more signiﬁcant where the number of labeled pairs is lower, and thus SEVEN can address the scarcity of labeled data better. Algorithm DDML can give good performance when we have enough labeled data but its performance is signiﬁcantly lower compared to SEVEN in cases with few labeled samples. DDML does not use the unlabeled data while other baselines beneﬁt from the information hidden in the unlabeled data. By increasing the number of labeled pairs, the difference in accuracy decreases. SEVEN outperforms all the semi-supervised baselines. One of the main advantages of SEVEN over other semisupervised methods is that they perform supervised step after pre-training with unlabeled data is ﬁnished. This may cancel out some of the learned information from unlabeled data through a supervised process. There is no guarantee that the supervised process can beneﬁt from the unsupervised learning [Rasmus et al., 2015]. Among the semi-supervised baselines, Pseudo-Label not only gives worse results compared to SEVEN, but also it shows lower performance than Pre Conv Sia and Pre Auto Sia in many cases. It can be related to the noise and error in estimating the labels for unlabeled pairs.

4.5 Model Analysis We perform some experiments to analyze the effect of the different components of SEVEN. The performances of different variants of SEVEN are given in Table 4. The number of labeled pairs for each dataset is indicated in front of the name of the dataset. Dis SEVEN indicates SEVEN with α = 0 in Eq. (8) which disables the Gi networks and the generative aspect of the model. This variant does not consider the unlabeled data during the learning. Gen SEVEN corresponds to a model that does not have the discriminate component. In other words, it does not have the contrastive layer and does

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Table 2: Performance of different methods on LFW and SONOF in terms of accuracy.

Dataset LFW SONOF # of labeled pairs 110 440 880 1320 1760 All 160 320 640 960 1280 All SEVEN 61.2 64.1 65.7 66.3 67.0 68.7 72.7 74.6 79.3 83.1 84.1 85.3 PCA - - - - - 64.5 - - - - - 67.61 DDML 51.5 54.2 61.9 63.8 64.8 71.1 58.5 67.7 72.5 78.4 82.9 86.1 Pseudo-label 52.0 52.2 53.9 57.4 57.9 70.1 53.8 59.9 63.2 71.0 80.5 84.5 Pre Conv Sia 55.1 62.3 63.5 63.2 64.2 66.0 61.9 67.1 70.4 71.5 78.8 82.1 Pre Auto Sia 51.1 62.9 63.0 63.5 64.2 66.1 57.2 62.7 66.4 70.1 73.1 79.0

Table 3: Performance of different methods on MNIST and USPS in terms of accuracy.

Dataset MNIST USPS # of labeled pairs 30 60 120 600 2400 All 40 80 160 300 800 All SEVEN 75.5 76.9 79.8 84.8 90.7 96.8 76.2 77.3 80.2 80.7 82.8 93.1 PCA - - - - - 65.84 - - - - - 70.96 DDML 61.1 65.9 75.7 84.0 90.4 96.8 69.0 71.8 75.7 75.9 80.8 92.7 Pseudo-label 59.8 67.9 76.8 83.2 89.3 95.2 70.1 57.4 57.9 77.2 78.3 93.3 Pre Conv Sia 64.4 73.0 77.2 82.7 90.8 97.2 72.2 78.2 77.6 78.1 82.9 93.0 Pre Auto Sia 61.5 68.0 71.9 78.9 84.7 93.1 70.6 73.9 69.0 75.0 82.0 90.2

Accuracy (%)

L = 880 L = 1320

Accuracy (%)

L = 320 L = 640

Figure 2: The accuracy of SEVEN for different values of parameter α for (a) MNIST, (b) LFW, (c) USPS and (d) SONOF.

not use the label information. SEVEN indicates the full variant of SEVEN with both generative and discriminative components. The variant SEVEN (MLP) is similar to the regular SEVEN, except that it uses fully connected layers instead of convolutional and transposed convolutional layers. Among all the different variants, full SEVEN gives the best performance. It shows the effectiveness of both the generative and discriminative components. It also veriﬁes the effectiveness of using the information hidden in the unlabeled data. The results show that the discriminative component has the broader impact compared to the generative component. SEVEN (MLP) gives weaker performance compared to SEVEN. It is mainly because of the capabilities of convolutional layers in modeling image data as it has also been shown by Conv Nets in image processing applications.

4.6 Parameter Sensitivity We analyze the effect of the parameter α in Equation (8) on the performance of SEVEN on all the four datasets. Parameter α of SEVEN controls the trade-off between the generative and discriminative aspects of the model. In Figure 2 the performance of SEVEN for different values of α is plotted. For each dataset, the performance is plotted for three different values of L (number of labeled pairs). There exists a tradeoff between the two generative and discriminative aspects of

SEVEN on all of the four datasets. As it can be seen, the optimum value of this parameter is dependent to the dataset and also to the ratio of labeled data to some extent.

Table 4: Performance of SEVEN variants in accuracy.

Method MNIST (120) LFW (440) USPS (80) SONOF (320) Dis SEVEN 75.7 54.2 73.9 70.3 Gen SEVEN 73.0 58.2 60.0 62.5 SEVEN 79.8 64.1 77.3 74.6 SEVEN (MLP) 73.1 60.0 77.3 70.9

5 Conclusion Beneﬁting from the salient structures hidden in the unlabeled data and the ability of deep neural networks in nonlinear function approximation, we propose a semi-supervised deep SEmi-supervised VEriﬁcation Network (SEVEN) for veriﬁcation tasks. SEVEN beneﬁts from both generative and discriminative modelings in a uniﬁed model. These two components are simultaneously trained which lead them to closely interact and inﬂuence each other. Extensive experiments demonstrate that SEVEN outperforms other state-of-the-art deep semi-supervised techniques in a wide spectrum of veriﬁcation tasks. Furthermore, SEVEN shows competitive performance compared with fully supervised baselines that require a signiﬁcantly larger amount of labeled data, indicating the important role of the generative component in SEVEN.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

[Bahaadini et al., 2017] Sara Bahaadini, Neda Rohani, Scott Coughlin, Michael Zevin, Vicky Kalogera, and Aggelos K Katsaggelos. Deep multi-view models for glitch classiﬁcation. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017.

[Bell and Bala, 2015] Sean Bell and Kavita Bala. Learning visual similarity for product design with convolutional neural networks. ACM Transactions on Graphics (TOG), 34(4):98, 2015.

[Bromley et al., 1993] Jane Bromley, James W Bentz, L eon Bottou, Isabelle Guyon, Yann Le Cun, Cliff Moore, Eduard S ackinger, and Roopak Shah. Signature veriﬁcation using a siamese time delay neural network. International Journal of Pattern Recognition and Artiﬁcial Intelligence, 7(04):669 688, 1993.

[Chopra et al., 2005] Sumit Chopra, Raia Hadsell, and Yann Le Cun. Learning a similarity metric discriminatively, with application to face veriﬁcation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 05), volume 1, pages 539 546. IEEE, 2005.

[Dauphin et al., 2015] Yann Dauphin, Harm de Vries, and Yoshua Bengio. Equilibrated adaptive learning rates for non-convex optimization. In Advances in Neural Information Processing Systems, pages 1504 1512, 2015.

[Dumoulin and Visin, 2016] Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning. ar Xiv preprint ar Xiv:1603.07285, 2016.

[Galbally et al., 2015] Javier Galbally, Moises Diaz-Cabrera, Miguel A Ferrer, Marta Gomez-Barrero, Aythami Morales, and Julian Fierrez. On-line signature recognition through the combination of real dynamic data and synthetically generated static data. Pattern Recognition, 48(9):2921 2934, 2015.

[Hoffer and Ailon, 2016] Elad Hoffer and Nir Ailon. Semisupervised deep learning by metric embedding. ar Xiv preprint ar Xiv:1611.01449, 2016.

[Hu et al., 2014] Junlin Hu, Jiwen Lu, and Yap-Peng Tan. Discriminative deep metric learning for face veriﬁcation in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1875 1882, 2014.

[Huang et al., 2012] Gary Huang, Marwan Mattar, Honglak Lee, and Erik G Learned-Miller. Learning to align from scratch. In Advances in Neural Information Processing Systems, pages 764 772, 2012.

[Hull, 1994] Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550 554, 1994.

[Ioffe and Szegedy, 2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167, 2015.

[Jain et al., 2016] Himanshu Jain, Yashoteja Prabhu, and Manik Varma. Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking & Other Missing Label Applications. In SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2016.

[Kingma et al., 2014] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581 3589, 2014.

[Koch, 2015] Gregory Koch. Siamese neural networks for one-shot image recognition. Ph D thesis, University of Toronto, 2015. [Le Cun et al., 2015] Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436 444, 2015. [Lee, 2013] Dong-Hyun Lee. Pseudo-label: The simple and efﬁcient semi-supervised learning method for deep neural networks. In ICML Workshop on Challenges in Representation Learning, volume 3, page 2, 2013. [Li et al., 2016] Yawei Li, Lizuo Jin, AK Qin, Changyin Sun, Yew Soon Ong, and Tong Cui. Semi-supervised auto-encoder based on manifold learning. In International Joint Conference on Neural Networks (IJCNN), pages 4032 4039, 2016. [Maaløe et al., 2016] Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Auxiliary deep generative models. In International Conference on Machine Learning (ICML), pages 1445 1453, 2016. [Masci et al., 2011] Jonathan Masci, Ueli Meier, Dan Cires an, and J urgen Schmidhuber. Stacked convolutional auto-encoders for hierarchical feature extraction. Artiﬁcial Neural Networks and Machine Learning (ICANN), pages 52 59, 2011. [Nair and Hinton, 2010] Vinod Nair and Geoffrey E Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In International Conference on Machine Learning (ICML), pages 807 814, 2010. [Oh Song et al., 2016] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4004 4012, 2016. [Rasmus et al., 2015] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pages 3546 3554, 2015. [Srivastava et al., 2014] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. Journal of Machine Learning Research, 15(1):1929 1958, 2014. [Sun et al., 2014] Yi Sun, Yuheng Chen, Xiaogang Wang, and Xiaoou Tang. Deep learning face representation by joint identiﬁcation-veriﬁcation. In Advances in Neural Information Processing Systems, pages 1988 1996, 2014. [Sun et al., 2017] Yao Sun, Lejian Ren, Zhen Wei, Bin Liu, Yanlong Zhai, and Si Liu. A weakly-supervised method for makeupinvariant face veriﬁcation. Pattern Recognition, 2017. [Wolf, 2014] Lior Wolf. Deepface: Closing the gap to human-level performance in face veriﬁcation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. [Xie and Philip, 2017] Sihong Xie and S Yu Philip. Active zeroshot learning: a novel approach to extreme multi-labeled classiﬁcation. International Journal of Data Science and Analytics, pages 1 10, 2017. [Zagoruyko and Komodakis, 2015] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4353 4361, 2015. [Zheng et al., 2017] Lei Zheng, Vahid Noroozi, and Philip S Yu. Joint deep modeling of users and items using reviews for recommendation. In 10th ACM International Conference on Web Search and Data Mining (WSDM), pages 425 434. ACM, 2017.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)