# image_matching_via_loopy_rnn__aaee26fe.pdf

Image Matching via Loopy RNN

Donghao Luo, Bingbing Ni, Yichao Yan, Xiaokang Yang Shanghai Jiao Tong University {luo-donghao, nibingbing, yanyichao, xkyang}@sjtu.edu.cn

Most existing matching algorithms are one-off algorithms, i.e., they usually measure the distance between the two image feature representation vectors for only one time. In contrast, human s vision system achieves this task, i.e., image matching, by recursively looking at speciﬁc/related parts of both images and then making the ﬁnal judgement. Towards this end, we propose a novel loopy recurrent neural network (Loopy RNN), which is capable of aggregating relationship information of two input images in a progressive/iterative manner and outputting the consolidated matching score in the ﬁnal iteration. A Loopy RNN features two uniqueness. First, built on conventional long short-term memory (LSTM) nodes, it links the output gate of the tail node to the input gate of the head node, thus it brings up symmetry property required for matching. Second, a monotonous loss designed for the proposed network guarantees increasing conﬁdence during the recursive matching process. Extensive experiments on several image matching benchmarks demonstrate the great potential of the proposed method.

1 Introduction

Image matching is a very important research topic in computer vision, due to its great potential in a wide range of realworld tasks including object/place retrieval [Arandjelovic et al., 2016], person re-identiﬁcation [Yan et al., 2016b], 3D reconstruction [Cheng et al., 2014], etc. Mathematically, a matching algorithm takes two images as inputs and outputs a score measuring the similarity of the two inputs, i.e., higher score indicates higher similarity between the two inputs. Previous research work is mainly focused on two aspects. On one hand, various image patch descriptors such as SIFT [Lowe, 2004], SURF [Bay et al., 2006], ORB [Rublee et al., 2011], etc., have been proposed to well represent the two patches, based on which the computed distance (e.g., Euclidean distance) can accurately reﬂect the true relationship between them. On the other hand, metric learning based methods [Jia and Darrell, 2011; Jain et al., 2012] have been

Match or not?

A. Glimpse of two images

B. Observe two images back and forth

Western Gull Ring-billed Gull

Figure 1: Difference between one-off matching and recursive matching.

developed to achieve more discriminative distance measure, which is superior to conventional Euclidean distance. Recently, deep learning has further made signiﬁcant progress in image matching on both aspects. For feature representation, SIFT based patch descriptors have been replaced with convolutional neural networks (CNN) based ones [Fischer et al., 2014; Paulin et al., 2015]. The results show a signiﬁcant performance gain. For distance metric learning, endto-end learning infrastructure has been utilized to enhance image matching. One remarkable example is the Siamese network [Bromley et al., 1993], in which two image patches are ﬁrst input to a two-stream convolutional sub-network (with identical parameters) to extract features, and then combined with a second sub-network (based on fully connected layers) to infer the similarity of two image patches. Siamese network has been widely used in many aspects of computer vision including people re-identiﬁcation [Yi et al., 2014] and tracking [Bertinetto et al., 2016]. Based on the end-to-end learnable capability provided by Siamese, Match Net [Han et al., 2015] [Zagoruyko and Komodakis, 2015] has recently boosted patch-based image matching performance. Despite their remarkable improvements, previous work on matching can all be regarded as a one-off solution, i.e., most algorithms perform the image patch feature extraction and distance calculation for only one time and output the ﬁnal matching score. However, human s vision system performs matching process in a rather recursive/iterative manner. To

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

judge whether the two images refer to the same species, human s attention is constantly switched between two images and moved to different patches/parts on the images. In other words, one will take turns observing different regions of both images to progressively aggregate information on the matched/un-matched portions of both images and get more and more conﬁdent. This process repeats until one is conﬁdent enough to make the ﬁnal judgement. As shown in Figure 1, it is difﬁcult to distinguish if the two birds are from same subspecies by observing the two images once. In this situation, human alternately observes two images and each observation is focused on some parts of the birds, such as the head and leg, etc., and ﬁnally makes a conﬁdent decision based on integral and local information. From a computational point of view, the above recursive/iterative matching mechanism also has advantages over the conventional one-off approach, as it can progressively attend to more and more discriminative regions of the images and get rid of the issue of cluttered background or irrelevant and noisy image features. It is thus demanding to develop a computational model or network structure to simulate the recursive mechanism to enhance image matching. Towards this end, it is natural to consider recurrent neural networks (e.g., RNN [Dorffner, 1996], LSTM [Hochreiter and Schmidhuber, 1997]). Intuitively, human s attentive regions/patches on both images could be considered as a sequence of observations. This observation sequence could naturally serves as the input sequence to a RNN/LSTM structure and the aggregated similarity measure could be output from the last (temporal) node of the recurrent network. However, such a sequential model cannot be directly applied for image matching as it violates the symmetric property which is required for a valid matching algorithm. Namely, the output similarity measurement should be unchanged if we switch the order of the two input image patches. To satisfy this symmetric requirement, we propose a loopy recurrent neural network (Loopy RNN). A Loopy RNN inherits basic components and structure of conventional RNN. The major differences between a Loopy RNN structure with a conventional one are that: 1) instead of having an arbitrary number of temporal nodes, it only has two, which correspond to the two input image patches, and 2) these two nodes are cyclicly linked, thus it brings up symmetry property. When applied to image patch matching, Loopy RNN can simulate the iterative process of examining image features from both images alternatively and progressively gather more and more matched information to consolidate the ﬁnal matching score. To facilitate model training and testing, an approximation from the Loopy RNN structure toward a normal RNN/LSTM structure is developed via duplicating the head and tail nodes for a number of times. To simulate human s perception as well as to guarantee robust matching, it also requires that the conﬁdence of similarity measurement increases when we goes deeper in our recursive matching network. For such a purpose, we utilize a monotonous objective function [Ma et al., 2016], which enforces more penalty to the output associated with deeper node in the network. The proposed Loopy RNN has been experimented on several image matching benchmark including UBS patch dataset [Winder et al., 2009] and Mikolajczyk dataset, and results demonstrate

performance gain over Siamese-like networks.

2 Related Work

Two key components are included in image matching, one is extracting proper features from the original image and the other is measuring the distance of the features to describe the similarity of images. At ﬁrst, hand-craft features such as SIFT [Lowe, 2004] and DAISY [Tola et al., 2008] are cooperated with ﬁxed metric method like Graph Model [Yan et al., 2016a; 2015a; 2015b] to match images. It means that the two parts of matching (extracting feature and measuring similarity) are independent when using above methods and the isolation hinders the improvement of performance. To break the isolation, researchers propose learning descriptors or similarity metric in condition of ﬁxed the other part. For example, [Brown et al., 2011] learns the descriptors by minimizing the classiﬁcation error and [Jain et al., 2012] learns the metric by treating it as a linear transformation. Learning descriptors and metric jointly is proposed to make the cooperation of the two parts more powerful. In [Trzcinski et al., 2012], boosting trick is adopted to learn descriptors and metrics and achieved great performance. The performance of these methods are limited by the hand-crafted features, while the proposed method employs deep features and achieves better performance. The advent of CNNs has tremendously promoted the development of many branches of computer vision including image matching. In [Krizhevsky et al., 2012], the performance of convolutional descriptor from Alex Net (trained on Image Net) has been proved more effective than SIFT in most cases. In [Ren et al., 2017], CNN also was used to compute dense correspondence. Combining CNN and Siamese structure [Bromley et al., 1993], it is natural to train the network in an end-to-end manner, i.e., learn descriptor and metric jointly. Match Net of [Han et al., 2015] employs a Siamese network in which some convolutional layers are adopted as a feature extractor and fully connected layers as a comparator to measure similarity. [Zagoruyko and Komodakis, 2015] explores different architectures to do patch-based image matching including Siamese (share parameter of CNN), Pseudo-Siamese (unshare parameter of CNN), and 2-channel (treat two patches as two channels of an image). In virtue of CNN s advantage, these methods obtain great promotion compared with previous traditional methods. These methods mentioned above can all be regarded as one-off, i.e., descriptors are compared just once. The Loopy RNN proposed in this paper learns the descriptors and metric jointly like Siamese, however, draws the conclusion by repeating comparing the descriptors. Note that Shyam et al. published a paper based on similar idea, which they call Attentive Recurrent Comparators [Shyam et al., 2017]. We recommend the readers to also refer to this contemporary work.

3 Methodology

The goal of this work is to develop a recusive/iterative matching framework to imitate the matching mechanism of human perception. To this end, we propose a loopy recurrent neural network (Loopy RNN) which not only aggregates individual

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

node node node node

Figure 2: A diagram of the proposed Loopy RNN (two nodes structure).

matching attempts and progressively yields more and more conﬁdent matching result but also preserves the symmetric property.

3.1 Loopy Recursive Neural Network Architecture. The basic structure of a loopy recurrent neural network (Loopy RNN) is illustrated in Figure 2. Our Loopy RNN consists of two sub-networks sharing parameter. Two recurrent nodes compose a sub-network. In this work, we adopt a standard long short-term memory (LSTM) [Hochreiter and Schmidhuber, 1997] node with input/output/forget/hidden cells as an atomic node of a Loopy RNN. Therefore we denote two sub-networks as LSTM and LSTM respectively. For LSTM, two nodes are n1, n2. For LSTM , two nodes are n 1, n 2. As illustrated in Figure 2, the difference between a sub-network RNN and a normal RNN architecture lies in the connection between nodes, i.e., a normal RNN is a linear structure, a Loopy RNN is a circular structure with the output of the tail node linked to its head node. We denote x A, x B as a pair of inputs, i.e., features of image patches. h A(B), o A(B) are the hidden state and output node corresponding to input image A (or B). For each node of LSTM, three gates input gate i, output gate o, and forget gate f as well as a memory cell c are included. The LSTM nodes in our loopy network are updated as follows:

i A(B) =σ(Wix A(B) + Uih B(A) + Vic B(A) + bi),

f A(B) =σ(Wfx A(B) + Ufh B(A) + Vfc B(A) + bf),

o A(B) =σ(Wox A(B) + Uoh B(A) + Voc A(B) + bo),

c A(B) =f A(B) c B(A) + i A(B) tanh (Wcx A(B)+

Uch B(A) + bc),

h A(B) =o A(B) tanh(c A(B)), (1) where σ is the sigmoid function and denotes the elementwise multiplication operator. W , U and V are the weight matrices, and b are the bias vectors. The memory cell c A(B) is a weighted sum of the previous memory cell c B(A) and a function of the current input. Proof of Symmetry: Because of the dual structure, it s straightforward to prove the symmetry of the proposed Loopy network. If we use F(x A, x B) and F(x B, x A) denoting the ﬁnal output hidden states of different input order-

s (x A, x B) and (x B, x A) respectively, i.e., average of the ﬁrst node s output. The symmetric property of matching requires that F(x A, x B) = F(x B, x A). As shown in Figure 2, the hidden state of node n1 with input x A is denoted as hn1,A. And hn 1,B is the hidden state of node n 1 with input x B. F(x A, x B) is the ﬁnal hidden state which is used to determine the similarity with the input order (x A, x B). F(x A, x B) and F(x B, x A) are determined as follows:

F(x A, x B) = (hn1,A + hn 1,B)/2,

F(x B, x A) = (hn1,B + hn 1,A)/2, (2)

LSTM and LSTM share parameters, therefore

hn1,A = hn 1,A,

hn1,B = hn 1,B, (3)

from Equation 2 and 3,

F(x A, x B) = F(x B, x A). (4)

Thus it s proven that the proposed Loopy RNN structure possess symmetry property.

3.2 Loopy RNN for Image Matching To facilitate image (patch) matching, additional network components shall be augmented/modiﬁed to the basic structure of Loopy RNN. First, a feature extraction sub-network, i.e., a CNN network, is utilized to map the original image patch to a learned feature space, to input to the core structure of loopy RNN (denoted as Feature Net in the rest of this paper). Second, as it is generally not feasible to train/test a loopy structure, we develop a simple yet effective structural approximation to convert a Loopy RNN into a conventional LSTM network to measure the similarity of a pair of features (denoted as Metric Net). Details are given as follows. Feature Net. We adopt the feature network of Match Net [Han et al., 2015] as our network prototype for extracting deep features, named as Feature Net. The structure of Feature Net is modulated from Alex Net [Krizhevsky et al., 2012]. The detailed network structure of Feature Net is illustrated in Figure 3. Note that the input patch size of Feature Net is 64 64 and the output feature dimension (which is connected to the core structure of Loopy RNN) is 4096. For the purpose of simplicity, in our work, dropout and local response normalization layers are omitted, since our input is image patch rather than the entire image (which is more complicated). Note that the feature extraction sub-networks for different input image patches share the same parameters. Metric Net. Although the two-node Loopy RNN has very simple structure, both forward and backward (training) computation for such a loopy structure is infeasible. Therefore, it is required to develop a simpler network structure which should not only mimic the recursive nature of Loopy RNN but also possess the advantage of easy training/testing. We observe that for a two-node Loopy RNN, the sequence of inputs could be regarded as an inﬁnite repeat of both inputs. For example, for two image patches IA and IB, their descriptors (the output feature vector from Feature Net) are denoted by x A and x B respectively, then the input to the two-node Loopy

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Input Pair-wise

similarity score of right category Similarity

Convolution layer Pooling layer

Random sampler

Feature Net

satisfy violate

Cross-entropy loss + Monotonous loss

Figure 3: The architecture of our network and the proposed monotonous loss. Right hand side shows two examples which satisfy (left) or violate (right) our monotonous cost.

RNN is the alternative sequence x A x B x A .... Motivated by this observation, we therefore utilize a conventional RNN/LSTM structure with a sequence of ﬁnite K repeated (x A, x B) patterns as input to approximate the Loopy RNN matching network, as shown in Figure 3. Thus it breaks the loopy structure to a standard RNN/LSTM structure, and off-the-shelf training algorithm could be directly applied. The proposed approximation has two advantages: 1) as information aggregates quickly through recursion, the output of network becomes stable after just a few repeats, i.e., K could be small, which makes model training and inference very efﬁcient; 2) if we regard each time step in the approximated LSTM network as an attention switch to one of the images, the proposed approximation naturally simulates the recursive matching process of human, i.e., iteratively switches attention between two input images and makes deeper and deeper comparisons. Corresponding to the input sequence, our approximate LSTM network outputs a sequence of pair-wise matching features as h0 h1 h2 ... hn 3 hn 2 hn 1. Each pair-wise feature vector hn encodes the similarity/comparison information aggregated from repeated input information from both images up to time step n. In contrast to previous matching algorithms which output only a scalar value to indicate similarity, our model outputs a feature vector to encode the similarity relationship between images, which conveys richer information and is more ﬂexible for postprocessing (in cases where later fusion is preferred). These output similarly feature vectors are sent to a softmax layer to determine the ﬁnal similarity score, i.e., the ﬁnal output of our network is a sequence of scores which describing the similarity of two image patches, denoted by s0 s1 s2 ... sn 3 sn 2 sn 1. sn is computed as follows:

sn = 1 1 + e θThn , n = 0, (5)

where θ is the set of softmax layer parameters. Each sn

ranges in the interval [0, 1] and greater value indicates higher similarity. In fact, the ﬁrst pair-wise feature h0 and ﬁrst score s0 are abandoned as Figure 3 showed, because only single patch information instead of pair-wise information is included in ﬁrst node.

3.3 Monotonous Loss As the network s attention iteratively switches between two input image patches and more and more comparisons are made, the measurement for similarity should be more and more conﬁdent. In other words, it is required that the similarity score of the correct category should be monotonically non-decreasing as the information propagates deeper along the proposed matching network, as illustrated in Figure 3. However, a plain cross-entropy loss does not enforce such a monotonous non-decreasing property. We therefore use a monotonous loss [Ma et al., 2016], which extends the conventional cross-entropy loss to enforce the accuracy of prediction increase when the matching process goes deeper. Mathematically, we can express this loss as: Lm n = max(0, ( 1)y(sn spre n )), n = 0, (6)

spre n = max(s1, s2..., sn 1), y = 1, min(s1, s2..., sn 1), y = 0. (7)

Here, Lm n denotes the monotonous loss at time-step n, which penalizes the corresponding node if the output similarity score violates the monotonous rule. y is the ground truth label, i.e., 1 for matched and 0 for un-matched. sn is the predicted similarity score at time step n and spre n is the maximum (y = 1) or minimum (y = 0) prediction score until time step n 1. The max operation of Equation 6 picks out the nodes that violate the monotonous rule. Denoted by Lc n as the standard cross-entropy loss, the overall loss could be expressed as: Lc n = (y log(sn) + (1 y) log(1 sn)), n = 0, Ln = Lc n + λLm n , n = 0, (8)

where λ is a weighting factor for both types of losses.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Name Type KS S OD Conv0 C 7x7 1 64 64 24 Pool0 MP 3x3 2 32 32 24 Conv1 C 5x5 1 32 32 64 Pool1 MP 3x3 2 16 16 64 Conv2 C 3x3 1 16 16 96 Conv3 C 3x3 1 16 16 96 Conv4 C 3x3 1 16 16 64 Pool4 MP 3x3 2 8 8 64

Table 1: Details of Feature Net architecture. C: convolutional layer. MP: max pooling layer. KS: Kernel size. S: stride. OD: output dimension of feature map. OD is present as (width height depth).

3.4 Implementation Details Data Preparation. Data imbalance is a common problem in patch-based image matching, because the number of positive pairs is far less than the number of negative pairs. A sampler which generates equal number of positive and negative pairs in a mini-batch is employed to prevent excessive bias to negative pairs. To improve the generalization capability, we augment the dataset by vertically and horizontally ﬂipping the original patch and rotating to 90, 180, 270 degrees. Following the previous work [Han et al., 2015], we map each pixel value x (in [0,255]) to (x 128)/160. Network Parameter and Training. The details of Feature Net are listed in Table 1. For Metric Net, there are 3 key factors which inﬂuence the performance of Loopy RNN model: 1) the weighting factor of monotonous loss λ; 2) the number of RNN nodes N (N {6, 8, 10, 12}); 3) the output dimension of LSTM node D (D {512, 1024, 1536, 2048}). Our models are trained on Caffe [Jia et al., 2014] and optimized by Stochastic Gradient Descent (SGD) with the batchsize 32. Learning rate is set to 0.01 at the beginning and decreased once every 1000 iterations. Our model converges to the steady state after about 70 epoches.

4 Experiments We evaluate our Loopy RNN network on two datasets: UBC dataset [Winder et al., 2009] and Mikolajczyk Dataset [Brown et al., 2011]. Extensive experimental evaluations and in-depth analysis of the proposed method are presented in this section.

4.1 Dataset and Evaluation Metric UBC Dataset. UBC includes three subsets: Liberty, Notredame and Yosemite. The number of image patches in the three subsets are 450k, 468k, and 634k respectively. For the three subsets, 100k, 200k, and 500k pre-generated pairs are provided and the number of positive pairs equals to negative pairs. Each patch in the dataset has a ﬁxed size of 64 64, which corresponds to the input dimension of our Feature Net. We follow the standard evaluation protocol [Brown et al., 2011], i.e., the model is iteratively trained on one subset and tested on the other two subsets, FPR95 (false positive rate at 95% recall) is adopted as the evaluation metric, the lower the better. Mikolajczyk Dataset. Mikolajczyk dataset is composed of 48 images in 8 sequences and each sequence corresponds to one of 5 transformations: viewpoint change, compression,

512 1024 1536 2048

(a) Results of different D.

6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9

(b) Results of loopy time N. Figure 4: Results of different LSTM output dimension D (with ﬁxed N = 10) and loopy time N (with ﬁxed D = 1024).

blur, lighting change and zoom with gradually increasing amount of transformation. Among 6 images of a sequence, one is reference image and the rest are transformed from reference image with different transformation magnitude, i.e., the degree of transformation. The ground truth homography between reference image and transformation images are provided for evaluation. We follow the method of [Mikolajczyk and Schmid, 2005] to test our model. MAP (Mean Average Precision), which measures the area under the precisionrecall curve is adopted as the evaluation metric.

4.2 Evaluation on UBC

In testing phase, we adopt the mean score of all nodes (except ﬁrst node) as the ﬁnal similarity score if monotonous loss is not used, i.e., λ = 0. Otherwise, the ﬁnal score is the mean of the last two nodes score, i.e., for a sequence of 8 nodes, s1, s2, ..., s7 are scores of last 7 nodes respectively, the ﬁnal score s = 1 7(s1 + s2 + ... + s7) if λ = 0, otherwise s = 1 2(s6 + s7). To verify the effectiveness of our Loopy RNN network, we compare our model with two recent works which apply CNN on patch-based image matching: Match Net [Han et al., 2015] and [Zagoruyko and Komodakis, 2015]. Table 2 lists the comparison results. Our best model achieves 6.32% average error rate with monotonous loss and parameter N = 10, D = 1024. n SIFT concat.+NNet is the concatenation of SIFT feature and neural network. Our network outperforms this model by a large margin mainly because our feature extracted by Feature Net is more effective. Before metric network, Match Net has the same architecture with our network and Match Net achieves 7.75% average FPR95. Therefore the 1.43% promotion compared with Match Net completely comes from the combination of our Loopy RNN architecture and monotonous loss. The rest networks are all Siamese-like. Compared with Siamese and Pseudo-Siamese network, the FPR95 of our network decreases 3.75% and 3.3% respectively. The gain comes from the Feature Net and our Loopy RNN network. Siamese-2stream and Siamese-2stream-l2 utilize information of different resolutions and achieves 7.63% and 9.67% average FPR95 respectively. Even only using information of one resolution, our model still obtains 1.31%, 3.34% improvement on FPR95. To verify the effect of monotonous loss, we list the results of our model without monotonous loss. It is obvious that the performance decreases compared with the model with monotonous loss. The experiment results illustrate that our Loopy RNN architecture has superiority over Siamese network and monotonous loss assists Loopy

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Train Liberty Notredame Yosemite Test Notredame Yosemite Liberty Yosemite Liberty Notredame mean n SIFT concat.+NNet 14.35 21.41 20.44 20.65 22.23 14.84 18.99 Match Net 3.87 10.88 6.9 8.39 10.77 5.67 7.75 Siamese 4.33 14.89 8.77 13.23 13.48 5.75 10.07 Pseudo-Siamese 3.93 12.5 12.87 12.64 10.35 5.44 9.62 Siamese-2stream 3.05 9.02 6.45 10.44 11.51 5.29 7.63 Siamese-2stream-l2 4.54 13.24 8.79 13.02 12.84 5.58 9.67 2ch-2stream 1.9 5 4.85 4.1 7.2 2.11 4.56 Loopy RNN(without monotonous loss) 3.02 8.92 6.64 8.13 9.56 3.96 6.70 Loopy RNN(with monotonous loss) 2.79 8.29 6.22 7.71 9.19 3.72 6.32

Table 2: Matching result of UBC. We set our Loopy RNN model with N = 10, D = 1024 and λ is set to 0.4 in the model with monotonous loss.

RNN to obtain further performance gain. 2ch-2stream network [Zagoruyko and Komodakis, 2015] obtains better performance than our network. It achieves the performance by treating two grayscale patches as two channels of a new image and classifying the new image into two category. The 2ch-2stream network performs better because it disposes two images jointly at the very beginning. Then each feature map includes the pair s feature. Because feature map can t input to the LSTM node directly, in our network, it is necessary to extract feature vector from image. Thus our Feature Net disposes two images respectively. Only in Metric Net, pair s feature is disposed. From image to feature vector, our network misses some pair information compared with the 2-channel network. Parameter Analysis. On one hand, we ﬁnd that large λ makes the proposed network hard to converge. Thus we can not set λ to a large value. On the other hand, λ is used for balancing the weights of monotonous loss and crossentropy loss. Too small value of λ weakens the function of monotonous loss. As a result, we set λ = 0.4 empirically. Figure 4 shows the experiment results with different LSTM output dimension D and loopy time N. We ﬁx N as 10 and test the model with different D (Figure 4(a)), then we ﬁx D as 1024 and test our model with different N (Figure 4(b)). We observe that higher dimension results to better performance as well as computational complexity. Here, we set D as 1024. For N, larger loopy time promotes the performance, however, when N exceeds 10, the performance saturates. Thus we set N to 10. This is because that the network already has enough observations for making the judgement. Based on the above analysis, we choose the model with parameter N = 10, D = 1024 as our best models even though the N = 10, D = 2048 model outperforms the former a little.

4.3 Evaluation On Local Descriptors We compare our network with ﬁve networks of [Zagoruyko and Komodakis, 2015] that are tested on the same dataset. MSER SIFT uses the Euclidean distance of SIFT features to measure the similarity of two patches. The other three models, MSER Imagenet, MSER Siam-2stream-l2 and MSER Siam-SPP-l2 substitute SIFT feature with CNN feature. MSER 2ch-2stream dispose 2 images as mentioned above. All the models are trained on the Liberty dataset. We test our model under different transformation with increasing magnitudes. Figure 5 illustrates the overall results. Our network outperforms most networks, especially in the extreme case.

0 5 10 15 20 25 30 35 40 45 50

Matching MAP

Transformation Magnitude

Average of all sequences

MSER SIFT MSER Siam-2stream_2 MSER Imagenet MSER Siam-SPP_2 MSER Loopy-RNN MSRE 2ch-2stream

Figure 5: Evaluation on the Mikolajczyk dataset. Our model is trained on Liberty with N = 10, D = 1024.

When transformation magnitude equals to 5, our network and 2ch-2stream greatly outperforms other networks by approximately 10%, which demonstrates the robustness of our model. MSER Siam-SPP-l2 is a network with SPP layer which makes the network able to deal with patches of different scale. Although our model only deals with single scale patches, it still outperforms MSER Siam-SPP-l2 in most cases. When magnitude equals to 1, 2, 5, MAP of our model achieves better performances by about 2%, 4% and 9% respectively compared with MSER Siam-SPP-l2. The performance of our network is comparable with the 2ch-2stream network. This indicates that our loopy network has good generalization ability as it performs well on different datasets. 5 Conclusion

In this paper, we propose a novel Loopy RNN which matches a pair of patches in a recurrent manner. Based on widely used cross-entropy loss, we add the monotonous loss aiming at restricting the output of a sequence. Combined with monotonous cross-entropy loss, our network imitates human to observe the two patches back and forth and the judgement become more and more conﬁdent in this process. Our experimental results show the effectiveness of the proposed method. Acknowledgements

The work was supported by State Key Research and Development Program (2016YFB1001003). This work was partially supported by a grant from the National Natural Science Foundation of China under contract No.U1611461, NSFC (61502301), China s Thousand Youth Talents Plan, National Natural Science Foundation of China (61521062), the 111 Project (B07022) and the Opening Project of Shanghai Key Laboratory of Digital Media Processing and Transmissions.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

References [Arandjelovic et al., 2016] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. In CVPR, 2016. [Bay et al., 2006] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In ECCV, 2006. [Bertinetto et al., 2016] Luca Bertinetto, Jack Valmadre, Jo ao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. In ECCV, 2016. [Bromley et al., 1993] Jane Bromley, James W Bentz, Bottou L eon, Isabelle Guyon, Yann Le Cun, Cliff Moore, Eduard S ackinger, and Roopak Shah. Signature veriﬁcation using a siamese time delay neural network. IJPRAI, pages 669 688, 1993. [Brown et al., 2011] Matthew Brown, Gang Hua, and Simon Winder. Discriminative learning of local image descriptors. IEEE TPAMI, pages 43 57, 2011. [Cheng et al., 2014] Jian Cheng, Cong Leng, Jiaxiang Wu, Hainan Cui, and Hanqing Lu. Fast and accurate image matching with cascade hashing for 3d reconstruction. In CVPR, 2014. [Dorffner, 1996] Georg Dorffner. Neural networks for time series processing. In Neural network world, 1996. [Fischer et al., 2014] Philipp Fischer, Alexey Dosovitskiy, and Thomas Brox. Descriptor matching with convolutional neural networks: a comparison to sift. ar Xiv preprint ar Xiv:1405.5769, 2014. [Han et al., 2015] Xufeng Han, Thomas Leung, Yangqing Jia, Rahul Sukthankar, and Alexander C Berg. Matchnet: Unifying feature and metric learning for patch-based matching. In CVPR, 2015. [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural computation, pages 1735 1780, 1997. [Jain et al., 2012] Prateek Jain, Brian Kulis, Jason V Davis, and Inderjit S Dhillon. Metric and kernel learning using a linear transformation. JMLR, pages 519 547, 2012. [Jia and Darrell, 2011] Yangqing Jia and Trevor Darrell. Heavy-tailed distances for gradient based image descriptors. In NIPS, 2011. [Jia et al., 2014] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM international conference on Multimedia, 2014. [Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, 2012. [Lowe, 2004] David G Lowe. Distinctive image features from scale-invariant keypoints. IJCV, pages 91 110, 2004.

[Ma et al., 2016] Shugao Ma, Leonid Sigal, and Stan Sclaroff. Learning activity progression in lstms for activity detection and early detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. [Mikolajczyk and Schmid, 2005] Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of local descriptors. IEEE TPAMI, 27(10):1615 1630, 2005. [Paulin et al., 2015] Mattis Paulin, Matthijs Douze, Zaid Harchaoui, Julien Mairal, Florent Perronin, and Cordelia Schmid. Local convolutional features with unsupervised training for image retrieval. In CVPR, 2015. [Ren et al., 2017] Zhe Ren, Junchi Yan, Bingbing Ni, Bin Liu, Xiaokang Yang, and Hongyuan Zha. Unsupervised deep learning for optical ﬂow estimation. In Thirty-First AAAI Conference on Artiﬁcial Intelligence, 2017. [Rublee et al., 2011] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efﬁcient alternative to sift or surf. In ICCV, 2011. [Shyam et al., 2017] Pranav Shyam, Shubham Gupta, and Amdebkar Dukkipati. Attentive recurrent comparators. ar Xiv preprint ar Xiv:1703.00767, 2017. [Tola et al., 2008] Engin Tola, Vincent Lepetit, and Pascal Fua. A fast local descriptor for dense matching. In CVPR, 2008. [Trzcinski et al., 2012] Tomasz Trzcinski, Mario Christoudias, Vincent Lepetit, and Pascal Fua. Learning image descriptors with the boosting-trick. In NIPS, 2012. [Winder et al., 2009] Simon Winder, Gang Hua, and Matthew Brown. Picking the best daisy. In CVPR, 2009. [Yan et al., 2015a] Junchi Yan, Jun Wang, Hongyuan Zha, Xiaokang Yang, and Stephen Chu. Consistency-driven alternating optimization for multigraph matching: A uniﬁed approach. IEEE Transactions on Image Processing, 24(3):994 1009, 2015. [Yan et al., 2015b] Junchi Yan, Chao Zhang, Hongyuan Zha, Wei Liu, Xiaokang Yang, and Stephen M Chu. Discrete hyper-graph matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. [Yan et al., 2016a] Junchi Yan, Minsu Cho, Hongyuan Zha, Xiaokang Yang, and Stephen M Chu. Multi-graph matching via afﬁnity optimization with graduated consistency regularization. IEEE transactions on pattern analysis and machine intelligence, 38(6):1228 1242, 2016. [Yan et al., 2016b] Yichao Yan, Bingbing Ni, Zhichao Song, Chao Ma, Yan Yan, and Xiaokang Yang. Person reidentiﬁcation via recurrent feature aggregation. In ECCV, 2016. [Yi et al., 2014] Dong Yi, Zhen Lei, Shengcai Liao, Stan Z Li, et al. Deep metric learning for person re-identiﬁcation. In ICPR, 2014. [Zagoruyko and Komodakis, 2015] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional neural networks. In CVPR, 2015.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)