# lfnet_learning_local_features_from_images__d3a45385.pdf LF-Net: Learning Local Features from Images Yuki Ono Sony Imaging Products & Solutions Inc. yuki.ono@sony.com Eduard Trulls École Polytechnique Fédérale de Lausanne eduard.trulls@epfl.ch Pascal Fua École Polytechnique Fédérale de Lausanne pascal.fua@epfl.ch Kwang Moo Yi Visual Computing Group, University of Victoria kyi@uvic.ca We present a novel deep architecture and a training strategy to learn a local feature pipeline from scratch, using collections of images without the need for human supervision. To do so we exploit depth and relative camera pose cues to create a virtual target that the network should achieve on one image, provided the outputs of the network for the other image. While this process is inherently non-differentiable, we show that we can optimize the network in a two-branch setup by confining it to one branch, while preserving differentiability in the other. We train our method on both indoor and outdoor datasets, with depth data from 3D sensors for the former, and depth estimates from an off-the-shelf Structure-from-Motion solution for the latter. Our models outperform the state of the art on sparse feature matching on both datasets, while running at 60+ fps for QVGA images. 1 Introduction Establishing correspondences across images is at the heart of many Computer Vision algorithms, such as those for wide-baseline stereo, object detection, and image retrieval. With the emergence of SIFT [23], sparse methods that find interest points and then match them across images became the de facto standard. In recent years, many of these approaches have been revisited using deep nets [11, 33, 48, 49], which has also sparked a revival for dense matching [9, 43, 45, 52, 53]. However, dense methods tend to fail in complex scenes with occlusions [49], while sparse methods still suffer from severe limitations. Some can only train individual parts of the feature extraction pipeline [33] while others can be trained end-to-end but still require the output of hand-crafted detectors to initialize the training process [11, 48, 49]. For the former, reported gains in performance may fade away when they are integrated into the full pipeline. For the latter, parts of the image which hand-crafted detectors miss are simply discarded for training. In this paper, we propose a sparse-matching method with a novel deep architecture, which we name LF-Net, for Local Feature Network, that is trainable end-to-end and does not require using a hand-crafted detector to generate training data. Instead, we use image pairs for which we know the relative pose and corresponding depth maps, which can be obtained either with laser scanners or shape-from-structure algorithms [34], without any further annotation. Being thus given dense correspondence data, we could train a feature extraction pipeline by selecting a number of keypoints over two images, computing descriptors for each keypoint, using the ground truth to determine which ones match correctly across images, and use those to learn good descriptors. This is, however, not feasible in practice. First, extracting multiple maxima from a score map is inherently not differentiable. Second, performing this operation over each image produces two 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. disjoint sets of keypoints which will typically produce very few ground truth matches, which we need to train the descriptor network, and in turn guide the detector towards keypoints which are distinctive and good for matching. We therefore propose to create a virtual target response for the network, using the ground-truth geometry in a non-differentiable way. Specifically, we run our detector on the first image, find the maxima, and then optimize the weights so that when run on the second image it produces a clean response map with sharp maxima at the right locations. Moreover, we warp the keypoints selected in this manner to the other image using the ground truth, guaranteeing a large pool of ground truth matches. Note that while we break differentiability in one branch, the other one can be trained end to end, which lets us learn discriminative features by learning the entire pipeline at once. We show that our method greatly outperforms the state-of-the-art. 2 Related work Since the appearance of SIFT [23], local features have played a crucial role in computer vision, becoming the de facto standard for wide-baseline image matching [14]. They are versatile [23, 29, 47] and remain useful in many scenarios. This remains true even in competition with deep network alternatives, which typically involve dense matching [9, 43, 45, 52, 53] and tend to work best on narrow baselines, as they can suffer from occlusions, which local features are robust against. Typically, feature extraction and matching comprises three stages: finding interest points, estimating their orientation, and creating a descriptor for each. SIFT [23], along with more recent methods [1, 5, 32, 48] implements the entire pipeline. However, many other approaches target some of their individual components, be it feature point extraction [31, 44], orientation estimation [50], or descriptor generation [36, 40, 41]. One problem with this approach is that increasing the performance of one component does not necessarily translate into overall improvements [35, 48]. Next, we briefly introduce some representative algorithms below, separating those that rely on hand-crafted features from those that use Machine Learning techniques extensively. Hand-crafted. SIFT [23] was the first widely successful attempt at designing an integrated solution for local feature extraction. Many subsequent efforts focused on reducing its computational requirements. For instance, SURF [5] used Haar filters and integral images for fast keypoint detection and descriptor extraction. DAISY [41] computed dense descriptors efficiently from convolutions of oriented gradient maps. The literature on this topic is very extensive we refer the reader to [28]. Learned. While methods such as FAST [30] used machine learning techniques to extract keypoints, most early efforts in this area targeted descriptors, e.g.using metric learning [38] or convex optimization [37]. However, with the advent of deep learning, there has been a renewed push towards replacing all the components of the standard pipeline by convolutional neural networks. Keypoints. In [44], piecewise-linear convolutional filters were used to make keypoint detection robust to severe lighting changes. In [33], neural networks are trained to rank keypoints. The latter is relevant to our work because no annotations are required to train the keypoint detector, but both methods are optimized for repeatability and not for the quality of the associated descriptors. Deep networks have also been used to learn covariant feature detectors, particularly towards invariance against affine transformations due to viewpoint changes [22, 26]. Orientations. The method of [50] is the only one we know of that focuses on improving orientation estimates. It uses a siamese network to predict the orientations that minimize the distance between the orientation-dependent descriptors of matching keypoints, assuming that the keypoints have been extracted using some other technique. Descriptors. The bulk of methods focus on descriptors. In [13, 51], the comparison metric is learned by training Siamese networks. Later works, starting with [36], rely on hard sample mining for training and the l2 norm for comparisons. A triplet-based loss function was introduced in [3], and in [25], negative samples are mined over the entire training batch. More recent efforts further increased performance using spectral pooling [46] and novel loss formulations [19]. However, none of these take into account what kind of keypoint they are working and typically use only SIFT. Crucially, performance improvements in popular benchmarks for a single one of either of these three components do not always survive when evaluating the whole pipeline [35, 48]. For example, Detector STN 𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖, 𝑠𝑠𝑖𝑖, 𝜃𝜃𝑖𝑖 Descriptor 𝑫𝑫𝑖𝑖 (a) The LF-Net architecture. The detector network generates a scale-space score map along with dense orientation estimates, which are used to select the keypoints. Image patches around the chosen keypoints are cropped with a differentiable sampler (STN) and fed to the descriptor network, which generates a descriptor for each patch. Gradient Flow Gradient Flow Desc. Desc. 𝑘𝑘 ℒ𝑝𝑝𝑝𝑝𝑖𝑖𝑝𝑝 𝑥𝑥𝑗𝑗, 𝑦𝑦𝑗𝑗, 𝑠𝑠𝑗𝑗, 𝜃𝜃𝑗𝑗 (b) For training we use a two-branch LF-Net, containing two identical copies of the network, processing two corresponding images Ii and Ij. Branch j (right) is used to generate a supervision signal for branch i (left), created by warping the results from i to j. As this is not differentiable, we optimize only over branch i, and update the network copy for branch j in the next iteration. We omit the samplers in this figure, for simplicity. Figure 1: (a) The Local Feature Network (LF-Net). (b) Training with two LF-Nets. keypoints are often evaluated on repeatability, which can be misleading because they may be repeatable but useless for matching purposes. Descriptors can prove very robust against photometric and geometric transformations, but this may be unnecessary or even counterproductive when patches are well-aligned, and results on the most common benchmark [7] are heavily saturated. This was demonstrated in [48], which integrated previous efforts [36, 44, 50] into a fully-differentiable architecture, reformulating the entire keypoint extraction pipeline with deep networks. It showed that not only is joint training necessary for optimal performance, but also that standard SIFT still outperforms many modern baselines. However, their approach still relies on SIFT keypoints for training, and as a result it can not learn where SIFT itself fails. Along the same lines, a deep network was introduced in [11] to match images with a keypoint-based formulation, assuming a homography model. However, it was largely trained on synthetic images or real images with affine transformations, and its effectiveness on practical wide-baseline stereo problems remains unproven. Fig. 1 depicts the LF-Net architecture (top), and our training pipeline with two LF-Nets (bottom). In the following we first describe our network in Section 3.1. We break it down into its individual components and detail how they are connected in order to build a complete feature extraction pipeline. In Section 3.2, we introduce our training architecture, which is based on two LF-Net copies processing separate images with non-differentiable components, along with the loss function used to learn the weights. In Section 3.3 we outline some technical details. 3.1 LF-Net: a Local Feature Network LF-Net has two main components. The first one is a dense, multi-scale, fully convolutional network that returns keypoint locations, scales, and orientations. It is designed to achieve fast inference time, and to be agnostic to image size. The second is a network that outputs local descriptors given patches cropped around the keypoints produced by the first network. We call them detector and descriptor. In the remainder of this section, we assume that the images have been undistorted using the camera calibration data. We convert them to grayscale for simplicity and simply normalize them individually using their mean and standard deviation [42]. As will be discussed in Section 4.1, depth maps and camera parameters can all be obtained using off-the-shelf Sf M algorithms [34]. As depth measurements are often missing around 3D object boundaries especially when computed Sf M algorithms image regions for which we do not have depth measurements are masked and discarded during training. Feature map generation. We first use a fully convolutional network to generate a rich feature map o from an image I, which can be used to extract keypoint locations as well as their attributes, i.e., scale and orientation. We do this for two reasons. First, it has been shown that using such a mid-level representation to estimate multiple quantities helps increase the predictive power of deep nets [21]. Second, it allows for larger batch sizes, that is, using more images simultaneously, which is key to training a robust detector. In practice, we use a simple Res Net [15] layout with three blocks. Each block contains 5 5 convolutional filters followed by batch normalization [17], leaky-Re LU activations, and another set of 5 5 convolutions. All convolutions are zero-padded to have the same output size as the input, and have 16 output channels. In our experiments, this has proved more successful that more recent architectures relying on strided convolutions and pixel shuffling [11]. Scale-invariant keypoint detection. To detect scale-invariant keypoints we propose a novel approach to scale-space detection that relies on the feature map o. To generate a scale-space response, we resize it N times, at uniform intervals between 1/R and R, where N = 5 and R = 2 in our experiments. These are convolved with N independent 5 5 filters size, which results in N score maps hn for 1 n < N, one for each scale. To increase the saliency of keypoints, we perform a differentiable form of non-maximum suppression by applying a softmax operator over 15 15 windows in a convolutional manner, which results in N sharper score maps, ˆhn 1 n