# visual_correspondence_hallucination__c1f21dfa.pdf

Published as a conference paper at ICLR 2022

VISUAL CORRESPONDENCE HALLUCINATION

Hugo Germain1, Vincent Lepetit1 and Guillaume Bourmaud2

1LIGM, École des Ponts, Univ Gustave Eiffel, CNRS, Marne-la-Vallée, France 2IMS, University of Bordeaux, Bordeaux INP, CNRS, Bordeaux, France {firstname.lastname}@enpc.fr, guillaume.bourmaud@u-bordeaux.fr

Given a pair of partially overlapping source and target images and a keypoint in the source image, the keypoint s correspondent in the target image can be either visible, occluded or outside the ﬁeld of view. Local feature matching methods are only able to identify the correspondent s location when it is visible, while humans can also hallucinate (i.e. predict) its location when it is occluded or outside the ﬁeld of view through geometric reasoning. In this paper, we bridge this gap by training a network to output a peaked probability distribution over the correspondent s location, regardless of this correspondent being visible, occluded, or outside the ﬁeld of view. We experimentally demonstrate that this network is indeed able to hallucinate correspondences on pairs of images captured in scenes that were not seen at training-time. We also apply this network to an absolute camera pose estimation problem and ﬁnd it is signiﬁcantly more robust than state-of-the-art local feature matching-based competitors.

1 INTRODUCTION

Establishing correspondences between two partially overlapping images is a fundamental computer vision problem with many applications. For example, state-of-the-art methods for visual localization from an input image rely on keypoint matches between the input image and a reference image (Sattler et al., 2018; Sarlin et al., 2019; 2020; Revaud et al., 2019). However, these local feature matching methods will still fail when few keypoints are covisible, i.e. when many image locations in one image are outside the ﬁeld of view or become occluded in the second image. These failures are to be expected since these methods are pure pattern recognition approaches that seek to identify correspondences, i.e. to ﬁnd correspondences in covisible regions, and consider the non-covisible regions as noise. By contrast, humans explain the presence of these non-covisible regions through geometric reasoning and consequently are able to hallucinate (i.e. predict) correspondences at those locations. Geometric reasoning has already been used in computer vision for image matching, but usually as an a posteriori processing (Fischler & Bolles, 1981; Luong & Faugeras, 1996; Barath & Matas, 2018; Chum et al., 2003; 2005; Barath et al., 2019; 2020). These methods seek to remove outliers from the set of correspondences produced by a local feature matching approach using only limited geometric models such as epipolar geometry or planar assumptions.

Contributions. In this paper we tackle the problem of correspondence hallucination. In doing so we seek to answer two questions: (i) can we derive a network architecture able to learn to hallucinate correspondences? and (ii) is correspondence hallucination beneﬁcial for absolute pose estimation? The answer to these questions is the main novelty of this paper. More precisely, we consider a network that takes as input a pair of partially overlapping source/target images and keypoints in the source image, and outputs for each keypoint a probability distribution over its correspondent s location in the target image plane. We propose to train this network to both identify and hallucinate the keypoints correspondents. We call the resulting method Neur Hal, for Neural Hallucinations. To the best of our knowledge, learning to hallucinate correspondences is a virgin territory, thus we ﬁrst provide an analysis of the speciﬁc features of that novel learning task. This analysis guides us towards employing an appropriate loss function and designing the architecture of the network. After training the network, we experimentally demonstrate that it is indeed able to hallucinate correspondences on unseen pairs of images captured in novel scenes. We also apply this network to a camera pose estimation problem and ﬁnd it is signiﬁcantly more robust than state-of-the-art local feature matching-based competitors.

Published as a conference paper at ICLR 2022

= Identified = Inpainted = Outpainted

Figure 1: Visual correspondence hallucination. Our network, called Neur Hal, takes as input a pair of partially overlapping source and target images and a set of keypoints detected in the source image, and outputs for each keypoint a probability distribution over its correspondent s location in the target image. When the correspondent is actually visible, its location can be identiﬁed; when it is not, its location must be hallucinated. Two types of hallucination tasks can be distinguished: 1) if the correspondent is occluded, its location has to be inpainted; 2) if it is outside the ﬁeld of view of the target image, its location needs to be outpainted. Neur Hal generalizes to scenes not seen during training: For each of these three pairs of source/target images coming from the test scenes of Scan Net (Dai et al., 2017) and Mega Depth (Li & Snavely, 2018), we show (top row) the source image with a small subset of keypoints, and (bottom row) the target image with the probability distributions predicted by our network and the ground truth correspondents: for the identiﬁed correspondents, + for the inpainted ones, and for the outpainted correspondents.

2 RELATED WORK

To the best of our knowledge, aiming at hallucinating visual correspondences has never been done but the related ﬁelds of local feature description and matching are immensely vast, and we focus here only on recent learning-based approaches.

Learning-based local feature description. Using deep neural networks to learn to compute local feature descriptors have shown to bring signiﬁcant improvements in invariance to viewpoint and illumination changes compared to handcrafted methods (Csurka & Humenberger, 2018; Gauglitz et al., 2011; Salahat & Qasaimeh, 2017; Balntas et al., 2017). Most methods learn descriptors locally around pre-computed covisible interest regions in both images (Yi et al., 2016; Detone et al., 2018; Balntas et al., 2016a; Luo et al., 2019), using convolutional-based siamese architectures trained with a contrastive loss (Gordo et al., 2016; Schroff et al., 2015; Balntas et al., 2016b; Radenovi c et al., 2016; Mishchuk et al., 2017; Simonyan et al., 2014), or using pose (Wang et al., 2020; Zhou et al., 2021) or self (Yang et al., 2021) supervision. To further improve the performances, (Dusmanu et al., 2019; Revaud et al., 2019) propose to jointly learn to detect and describe keypoints in both images, while Germain et al. (2020) only detects in one image and densely matches descriptors in the other.

Learning-based local feature matching. All the methods described in the previous paragraph establish correspondences by comparing descriptors using a simple operation such as a dot product. Thus the combination of such a simple matching method with a siamese architecture inevitably produces outlier correspondences, especially in non-covisible regions. To reduce the amount of outliers, most approaches employ so-called Mutual Nearest Neighbor (MNN) ﬁltering. However, it is possible to go beyond a simple MNN and learn to match descriptors. Learning-based matching methods (Zhang et al., 2019; Brachmann & Rother, 2019; Moo Yi et al., 2018; Sun et al., 2020; Choy et al., 2020; 2016) take as input local descriptors and/or putative correspondences, and learn to output correspondences probabilities. However, all these matching methods focus only on predicting correctly covisible correspondences.

Published as a conference paper at ICLR 2022

Jointly learning local feature description and matching. Several methods have recently proposed to jointly learn to compute and match descriptors (Sarlin et al., 2020; Sun et al., 2021; Li et al., 2020; Rocco et al., 2018; 2020). All these methods use a siamese Convolutional Neural Network (CNN) to obtain dense local descriptors, but they signiﬁcantly differ regarding the way they establish matches. They actually fall into two categories. The ﬁrst category of methods (Li et al., 2020; Rocco et al., 2018; 2020) computes a 4D correlation tensor that essentially represents the scores of all the possible correspondences. This 4D correlation tensor is then used as input to a second network that learns to modify it using soft-MNN and 4D convolutions. Instead of summarizing all the information into a 4D correlation tensor, the second category of methods (Sarlin et al., 2020; Sun et al., 2021) rely on Transformers (Vaswani et al., 2017; Dosovitskiy et al., 2020; Ramachandran et al., 2019; Caron et al., 2021; Cordonnier et al., 2020; Zhao et al., 2020; Katharopoulos et al., 2020) to let the descriptors of both images communicate and adapt to each other. All these methods again focus on identifying correctly covisible correspondences and consider non-covisible correspondences as noise. While our architecture is closely related to the second category of methods as we also rely on Transformers, the motivation for using it is quite different since it is our goal of hallucinating correspondences that calls for a non-siamese architecture (see Sec.3).

Visual content hallucination. (Yang et al., 2019) proposes to hallucinate the content of RGB-D scans to perform relative pose estimation between two images. More recently (Chen et al., 2021) regresses distributions over relative camera poses for spherical images using joint processing of both images. The work of (Yang et al., 2020; Qian et al., 2020; Jin et al., 2021) shows that employing a hallucinate-then-match paradigm can be a reliable way of recovering 3D geometry or relative pose from sparsely sampled images. In this work, we focus on the problem of correspondence hallucination which unlike previously mentioned approaches does not aim at recovering explicit visual content or directly regressing a camera pose. Perhaps closest to our goal is Cai et al. (2021) that seeks to estimate a relative rotation between two non-overlapping images by learning to reason about hidden cues such as direction of shadows in outdoor scenes, parallel lines or vanishing points.

3 OUR APPROACH

Our goal is to train a network that takes as input a pair of partially overlapping source/target images and keypoints in the source image, and outputs for each keypoint a probability distribution over its correspondent s location in the target image plane, regardless of this correspondent being visible, occluded, or outside the ﬁeld of view. While the problem of learning to ﬁnd the location of a visible correspondent received a lot of attention in the past few years (see Sec. 2), to the best of our knowledge, this paper is the ﬁrst attempt of learning to ﬁnd the location of a correspondent regardless of this correspondent being visible, occluded, or outside the ﬁeld of view. Since this learning task is virgin territory, we ﬁrst analyze its speciﬁc features below, before deﬁning a loss function and a network architecture able to handle these features.

3.1 ANALYSIS OF THE PROBLEM

The task of ﬁnding the location of a correspondent regardless of this correspondent being visible, occluded, or outside the ﬁeld of view actually leads to three different problems. Before stating those three problems, let us ﬁrst recall the notion of correspondent as it is the keystone of our problem.

Correspondent. Given a keypoint p S R2 in the source image IS, its depth d S R+, and the relative camera pose RTS SO(3), t TS R3 between the coordinate systems of IS and the target image IT, the correspondent p T R2 of p S in the target image plane is obtained by warping p S: p T := ω (d S, p S, RTS, t TS) := KTπ d SRTSK 1 S p S + t TS , where KS and KT are the camera calibration matrices of source and target images and π (u) := [ux/uz, uy/uz, 1]T is the projection function. In a slight abuse of notation, we do not distinguish a homogeneous 2D vector from a non-homogeneous 2D vector. Let us highlight that the correspondent p T of p S may not be visible, i.e. it may be occluded or outside the ﬁeld of view.

Identifying the correspondent. In the case where a network has to establish a correspondence between a keypoint p S in IS and its visible correspondent p T in IT, standard approaches, such as

Published as a conference paper at ICLR 2022

comparing a local descriptor computed at p S in IS with local descriptors computed at detected keypoints in IT, are applicable to identify the correspondent p T.

Outpainting the correspondent. When p T is outside the ﬁeld of view of IT, there is nothing to identify, i.e. neither can p T be detected as a keypoint nor can a local descriptor be computed at that location. Here the network ﬁrst needs to identify correspondences in the region where IT overlaps with IS and realize that the correspondent p T is outside the ﬁeld of view to eventually outpaint it (see Fig. 1). We call this operation "outpainting the correspondent" as the network needs to predict the location of p T outside the ﬁeld of view of IT.

Inpainting the correspondent. When p T is occluded in IT, the problem is even more difﬁcult since local features can be computed at that location but will not match the local descriptor computed at p S in IS. As in the outpainting case, the network needs to identify correspondences in the region where IT overlaps with IS and realize that the correspondent p T is occluded to eventually inpaint the correspondent p T (see Fig. 1). We call this operation "inpainting the correspondent" as the network needs to predict the location of p T behind the occluding object.

Let us now introduce a loss function and an architecture that are able to unify the identifying, inpainting and outpainting tasks.

3.2 LOSS FUNCTION

The distinction we made between the identifying, inpainting and outpainting tasks come from the fact that the source image IS and the target image IT are the projections of the same 3D environment from two different camera poses. In order to integrate this idea and obtain a uniﬁed correspondence learning task, we rely on the Neural Reprojection Error (NRE) introduced by (Germain et al., 2021). In order to properly present the NRE, we ﬁrst recall the notion of correspondence map.

Correspondence map. Given IS, IT and a keypoint p S in the image plane of IS, the correspondence map CT of p S in the image plane of IT is a 2D tensor of size HC WC such that CT (p T) := p (p T|p S, IS, IT) is the likelihood of p T being the correspondent of p S. The likelihood can only be evaluated for p T ΩCT where ΩCT is the set of all the pixel locations in CT. Here, we implicitly deﬁned that the likelihood of p T falling outside the boundaries of CT is zero. In practice, a correspondence map CT is implemented as a neural network that takes as input p S, IS and IT, and outputs a softmaxed 2D tensor. A correspondence map CT may not have the same number of lines and columns than IT especially when the goal is to outpaint a correspondence. Thus, in the general case, to transform a 2D point from the image plane of IT to the correspondence plane of CT, we will need another afﬁne transformation matrix KC. Let us highlight that this likelihood is obtained using the visual content of IS and IT only.

Neural Reprojection Error. The NRE (Germain et al., 2021) is a loss function that warps a keypoint p S into the image plane of IT and evaluates the negative log-likelihood at this location. In our context, the NRE can be written as:

NRE (p S, CT, RTS, t TS, d S) := ln CT (x T) where x T = KCω (d S, p S, RTS, t TS) . (1)

In general, x T does not have integer coordinates and the notation ln CT (x T) corresponds to performing a bilinear interpolation after the logarithm. For more details concerning the derivation of the NRE, the reader is referred to Germain et al. (2021).

The NRE provides us with a framework to learn to identify, inpaint or outpaint the correspondent of p S in IT in a uniﬁed manner since Eq. (1) is differentiable w.r.t. CT and there is no assumption regarding covisibility. The main difﬁculty to overcome is the deﬁnition of a network architecture able to output a consistent CT being given only p S, IS and IT as inputs, i.e. the network must ﬁgure out whether the correspondent of p S in IT can be identiﬁed or has to be inpainted or outpainted.

3.3 NETWORK ARCHITECTURE

The analysis from Sec. 3.1 and the use of the NRE as a loss (Sec. 3.2) call for: a non-siamese architecture to be able to link the information from IS with the information from IT

Published as a conference paper at ICLR 2022

to outpaint or inpaint the correspondent if needed; an architecture that outputs a matching score for all the possible locations in IT as well as locations beyond the ﬁeld of view of IT as the network could decide to identify, inpaint or outpaint a correspondent at these locations.

To fulﬁll these requirements, we propose the following: Our network takes as input IS and IT as well as a set of keypoints {p S,n}n=1...N in the source image plane of IS. A siamese CNN backbone is applied to IS and IT to produce compact dense local descriptor maps HS and HT. In order to be able to outpaint correspondents in the target image plane, we pad HT with a learnable ﬁxed vector λ. This padding step allows to initialize descriptors at locations outside the ﬁeld of view of IT. We note γ the relative output-to-input correspondence map resolution ratio.

The dense descriptor maps HS and HT,pad, and the keypoints {p S,n}n=1...N are then used as inputs of a cross-attention-based backbone F with positional encoding. This part of the network outputs a feature vector d S,n for each keypoint p S,n and dense feature vectors DT,pad of the size of HT,pad. This cross-attention-based backbone allows the local descriptors HS and HT,pad to communicate with each other. Thus, during training, the network will be able to leverage this ability to communicate, to learn to hallucinate peaked inpainted and outpainted correspondence maps.

{p S,n}n=1...N

{d S,n}n=1...N

{CT,,n}n=1...N

Figure 2: Overview of Neur Hal: See text for details.

The correspondence map CT,n of p S,n in the image plane of IT is computed by applying a 1 1 convolution to DT,pad using d S,n as ﬁlter, followed by a 2D softmax.

An overview of our architecture, that we call Neur Hal, is presented in Fig. 2. In practice, in order to keep the required amount of memory and the computational time reasonably low, the correspondence maps {CT,n}n=1...N have a low resolution, i.e. for a target image of size 640 480, we use a CNN with an effective stride of s = 8 and consequently the resulting correspondence maps (with γ = 50%) are of size 160 120. Producing low resolution correspondence maps prevents Neur Hal from predicting accurate correspondences. But as we show in the experiments, this low resolution is sufﬁcient to hallucinate correspondences and have an afﬁrmative answer to both questions: (i) can we derive a network architecture able to learn to hallucinate correspondences? and (ii) is correspondence hallucination beneﬁcial for absolute pose estimation? Thus, we leave the question of the accuracy of hallucinated correspondences for future research. Additional details concerning the architecture are provided in Sec. C.1 of the appendix.

3.4 TRAINING-TIME

Given a pair of partially overlapping images (IS, IT), a set of keypoints with ground truth depths {p S,n, d S,n}n=1...N as well as the ground truth relative camera pose (RTS, t TS), the corresponding sum of NRE terms (Eq. 1) can be minimized w.r.t. the parameters of the network that produces the correspondence maps. Thus, we train our network using stochastic gradient descent and early stopping by providing pairs of overlapping images along with the aforementioned ground truth information. Let us also highlight that there is no distinction in the training process between the identifying, inpainting and outpainting tasks since the only thing our network outputs are correspondence maps. Moreover there is no need for labeling keypoints with ground truth labels such as "identify/visible", "inpaint/occluded" or "outpaint/outside the ﬁeld of view". Additional information concerning the training are provided in Sec. C.2 of the appendix.

3.5 TEST-TIME

At test-time, our network only requires a pair of partially overlapping images (IS, IT) as well as keypoints {p S,n}n=1...N in IS, and outputs a correspondence map CT,n in the image plane of IT for each keypoint, regardless of its correspondent being visible, occluded or outside the ﬁeld of view.

Published as a conference paper at ICLR 2022

2 4 6 8 10 12 0

Outpainted Inpainted Identified

0 50 100 150 200 250 300 0.00

2 4 6 8 10 12 0

0 200 400 600 800 0.00

Argmax Error (px)

Figure 3: Evaluation of the ability of Neur Hal to hallucinate correspondences on the test scenes of Scan Net and Mega Depth. (left) Histograms of the NRE (see Eq. 1) for each task (identifying, outpainting, inpainting), computed on correspondence maps produced by Neur Hal. The value ln |ΩCT| is the NRE of a uniform correspondence map. (right) Histograms of the errors between the argmax (mode) of a correspondence map and the ground truth correspondent s location, for each task. The value EU is the average error of a random prediction.

4 EXPERIMENTS

In these experiments, we seek to answer two questions: 1) "Is the proposed Neur Hal approach presented in Sec. 3 capable of hallucinating correspondences?" and 2) "In the context of absolute camera pose estimation, does the ability to hallucinate correspondences bring further robustness?".

4.1 EVALUATION OF THE ABILITY TO HALLUCINATE CORRESPONDENCES

We evaluate the ability of our network to hallucinate correspondences on four datasets: the indoor datasets Scan Net (Dai et al., 2017) and NYU (Nathan Silberman & Fergus, 2012), and the outdoor datasets Mega Depth (Li & Snavely, 2018) and ETH-3D (Schöps et al., 2017). For the indoor setting (outdoor setting, respectively), we train Neur Hal on Scan Net (Megadepth, respectively) on the training scenes as described in Sec. 3.4, and evaluate it on the disjoint set of validation scenes. Thus, all the qualitative and quantitative results presented in this section cannot be ascribed to scene memorization. For each dataset, we run predictions over 2, 500 source and target image pairs sampled from the test set, with overlaps between 2% and 80%. For every image pair, we also feed as input to Neur Hal keypoints in the source image. These keypoints have known ground truth correspondents in the target image and labels (visible, occluded, outside the ﬁeld of view) that we use to evaluate the ability of our network to hallucinate correspondences. For more details on the settings of our experiment see Sec. C.2. For this experiment, we use γ = 50%.

We report in Fig. 3 two histograms computed over more than one million keypoints for each task we seek to validate: identiﬁcation, inpainting, and outpainting. The ﬁrst histogram Fig. 3 (left) is obtained by evaluating for each correspondence map the NRE cost (Eq. 1) at the ground truth correspondent s location. In order to draw conclusions, we also report the negative log-likelihood of a uniform correspondence map (ln |ΩCT|). We ﬁnd that for each task and for both datasets, the predicted probability mass lies signiﬁcantly below ln |ΩCT|, which demonstrates Neur Hal s ability to perform identiﬁcation, inpainting and outpainting. On Scan Net, we also observe that identiﬁcation is a simpler task than outpainting while inpainting is the hardest task: On average, the NRE cost of inpainted correspondents is higher than the average NRE cost of outpainted correspondents, which indicates the predicted correspondence maps are less peaked for inpainting than they are for outpainting. This corroborates what we empirically observed on qualitative results in Fig. 1, and supports our analysis in Sec. 3.1. On Megadepth, outpainting and inpainting histograms have a similar shape which does

Published as a conference paper at ICLR 2022

0 50 100 150 200 0

Distance (px)

Precision (%)

(a) Inpainting - S

0 50 100 150 200 0

Distance (px)

(b) Outpainting - S

0 100 200 300 400 0

Distance (px)

(c) Inpainting - M

0 100 200 300 400 0

Distance (px)

(d) Outpainting - M

Lo FTR (Sun et al., 2021) DRCNet (Li et al., 2020) S2D (Germain et al., 2021) Neur Hal

Figure 4: Ability to hallucinate - comparison against state-of-the-art local feature matching methods on Scan Net (S) and Megadepth (M). For each method, we report the percentage of keypoint s correspondents whose distance w.r.t. the ground truth location is lower than x pixels, as a function of x, for (a-c) the inpainting task and (b-d) the outpainting task.

not reﬂect the previous statement, but we believe this is due to the fact that inpainting labels are noisy for this dataset, as explained in Sec. C.2.

On the right histogram of Fig. 3, we report the distribution of the distance between the argmax of a correspondence map and the ground truth correspondent s location. We also report the average error of a random prediction. We ﬁnd the histogram mass lies signiﬁcantly to the left of the random prediction average error, indicating our model is able to place modes correctly in the correspondence maps, regardless of the task at hand. On Scan Net, we observe that the inpainting and outpainting histograms are very similar, indicating the predicted argmax is equally good for both tasks. As mentioned above, the correspondence maps produced by Neur Hal have a low resolution (see Sec. 3.3) which explains why the "argmax error" is not closer to zero pixel.

In Fig. 4, we compare the hallucination performances of Neur Hal against state-of-the-art local feature matching methods. Since all these local feature matching methods were designed and trained on pairs of images with signiﬁcant overlap to perform only identiﬁcation, they obtain poor inpainting results. Concerning the outpainting task, these methods seek to ﬁnd a correspondent within the image boundaries, consequently they cannot outpaint correspondences and obtain very poor results.

In Fig. 5 we show several qualitative inpainting/outpainting results on Scan Net and Mega Depth datasets. In the appendix, we also report qualitative results obtained on the NYU Depth dataset (Fig. 16) and on the ETH-3D dataset (Fig. 15).

These results allow us to conclude that Neur Hal is able to hallucinate correspondences with a strong generalization capacity. Additional experiments concerning the ability to hallucinate correspondences are provided in Sec. A as well as technical details regarding the evaluation protocol in Sec. C.3.

4.2 APPLICATION TO ABSOLUTE CAMERA POSE ESTIMATION

In the previous experiment, we showed that our network is able to hallucinate correspondences. We now evaluate whether this ability helps improving the robustness of an absolute camera pose estimator. We run this evaluation on the test set of Scan Net over 2,500 source and target image pairs captured in scenes that were not used at training time. For each source/target image pair, we employ Neur Hal to produce correspondence maps. As in the previous experiment, we use γ = 50%. Given these correspondence maps and the depth map of the source image, we estimate the absolute camera pose between the target image and the source image using the method proposed in Germain et al. (2021).

In Fig. 6, we show the results of an ablation study conducted on Scan Net. In this study, we focus on the robustness of the camera pose estimate for various combinations of training data, i.e. we consider a pose is "correct" if the rotation error is lower than 20 degrees and the translation error is below 1.5 meters (see Sec. C.3). We ﬁnd that training our network to perform the three tasks (identiﬁcation, inpainting, and outpainting) produces the best results. In particular, we ﬁnd that adding outpainting plays a critical role in improving localization of low-overlap image pairs. We also ﬁnd that learning to inpaint does not bring much improvement to the absolute camera pose estimation.

In Fig. 7, we compare the results of Neur Hal against state-of-the-art local feature matching methods. In low-overlap settings, very few keypoints correspondents can be identiﬁed and many keypoints correspondents have to be outpainted. In this case, we ﬁnd that Neur Hal is able to estimate the camera pose correctly signiﬁcantly more often than any other method, since Neur Hal is the only method able to outpaint correspondences (see Fig. 4). For high-overlap image pairs, the ability to

Published as a conference paper at ICLR 2022

Scan Net Megadepth

- ln CT Target Source

- ln CT Target Source

Figure 5: Ability to hallucinate - Qualitative inpainting/outpainting results. To illustrate the ability of Neur Hal to hallucinate correspondents, we display correspondence maps predicted by Neur Hal on image pairs (captured in scenes that were not seen at training-time): (top row) outpainting examples, (bottom row) inpainting examples. In the source image, the red dot is a keypoint. In the target image and in the (negative-log) correspondence map, the red dot represents the ground truth keypoint s correspondent. The dashed rectangles represent the borders of the target images. More results on the NYU and ETH-3D datasets can be found in the appendix D.1.

hallucinate is not useful since many keypoints correspondents can be identiﬁed. In this case, we ﬁnd that state-of-the-art local feature matching methods to be slightly better than Neur Hal. This is likely due to the fact that Neur Hal outputs low resolution correspondences maps while the other methods output high resolution correspondences. The overall performance shows that Neur Hal signiﬁcantly outperforms all the competitors, which allows us to conclude that the ability of Neur Hal to outpaint correspondences is beneﬁcial for absolute pose estimation. Technical details concerning the previous experiment as well as additional experiments concerning the application to absolute camera pose estimation are provided in Sec. B).

5 LIMITATIONS

We identiﬁed the following limitations for our approach: (i) - The previous experiments showed that Neur Hal is able to inpaint correspondences but the inpainted correspondence maps are much less peaked compared to the outpainted correspondence maps. This is likely due to the fact that inpainting correspondences is much more difﬁcult than outpainting correspondences (see Sec 3.1). (ii) - The proposed architecture outputs low resolution correspondence maps (see Sec. 3.3), e.g. 160 120 for input images of size 640 480 and an amount of padding γ = 50%. This is essentially due to the quadratic complexity of attention layers we use (see Sec. C.1 of the appendix). (iii) - Our approach is able to outpaint correspondences but our correspondence maps have a ﬁnite size. Thus, in the case where a keypoint s correspondent falls outside the correspondence map, the resulting correspondence map would be erroneous. We believe these three limitations are interesting future research directions.

Published as a conference paper at ICLR 2022

0 20 40 60 80

Visual Overlap (%)

Method Training correspondences

Identiﬁed Inpainted Outpainted

Figure 6: Ablation study - Impact of learning to hallucinate for absolute camera pose estimation. We compare the inﬂuence of adding inpainting and outpainting (γ = 50%) tasks when training Neur Hal. We report the percentage of camera poses being correctly estimated for image pairs having an overlap between 2% and x%, as a function of x, on Scan Net (Dai et al., 2017), with thresholds for translation and rotation errors of τt = 1.5m and τr = 20.0 . Learning to hallucinate correspondences (especially outpainting) signiﬁcantly improves the amount of correctly estimated poses.

0 20 40 60 80

Visual Overlap (%)

(a) τt = 1.5m, τr = 20

0 20 40 60 80

Visual Overlap (%)

(b) τt = 1.0m, τr = 15

R2D2 (Revaud et al., 2019)

SP+SG (Sarlin et al., 2020)

Lo FTR (Sun et al., 2021)

DRCNet (Li et al., 2020)

S2D (Germain et al., 2021)

Figure 7: Absolute camera pose experiment. We compare the performance of Neur Hal against state-of-the-art local feature matching methods on Scan Net (Dai et al., 2017). The "identity" method consists in systematically predicting the identity pose. We report the percentage of camera poses being correctly estimated for pairs of images that have an overlap between 2% and x%, as a function of x, for two rotation and translation error thresholds. See discussion in Sec. 4.2.

6 CONCLUSION

To the best of our knowledge, this paper is the ﬁrst attempt to learn to inpaint and outpaint correspondences. We proposed an analysis of this novel learning task, which has guided us towards employing an appropriate loss function and designing the architecture of our network. We experimentally demonstrated that our network is indeed able to inpaint and outpaint correspondences on pairs of images captured in scenes that were not seen at training-time, in both indoor (Scan Net) and outdoor (Megadepth) settings. We also tested our network on other datasets (ETH3D and NYU) and discovered that our model has strong generalization ability. We then tried to experimentally illustrate that hallucinating correspondences is not just a fundamental AI problem but is also interesting from a practical point of view. We applied our network to an absolute camera pose estimation problem and found that hallucinating correspondences, especially outpainting correspondences, allowed to signiﬁcantly outperform the state-of-the-art feature matching methods in terms of robustness of the resulting pose estimate. Beyond this absolute pose estimation application, this work points to new research directions such as integrating correspondence hallucination into Structure-from-Motion pipelines to make them more robust when few images are available.

Published as a conference paper at ICLR 2022

7 ETHICS STATEMENT

The method described in this paper has the potential to greatly improve many computer vision-based industrial applications, especially those involving visual localization in GPS-denied or cluttered environments. For example robotics or augmented reality applications could beneﬁt from our algorithm to better relocalize within their surroundings, which could lead to more reliable and overall safer behaviours. If this was to be applied to autonomous driving or drone-based search and rescue, one could appreciate the positive societal impact of our method. On the other hand like many computer vision algorithms, it could be applied to improve robustness of malicious devices such as weaponized UAVs, or invade citizens privacy through environment re-identiﬁcation. Thankfully as AI technology advances, discussions and regulations are brought forward by governments and public entities.

These ethical debates pave the way for a brighter future and can only make us think Neur Hal will more bring beneﬁts than harms to society.

8 REPRODUCIBILITY

We provide the Neur Hal model architecture and weights in the supplementary material. We also release a simple evaluation script that generates qualitative results, and show in a notebook the results obtained on an image pair captured indoors using a smartphone.

ACKNOWLEDGEMENT

The authors would like to thank Matthieu Vilain and Rémi Giraud for their insight on visual correspondence hallucination. This project has received funding from the Bosch Research Foundation (Bosch Forschungsstiftung). This work was granted access to the HPC resources of IDRIS under the allocation 2021-AD011011682R1 made by GENCI.

Vassileios Balntas, Edward Johns, Lilian Tang, and Krystian Mikolajczyk. PN-Net: Conjoined Triple Deep Network for Learning Local Image Descriptors. In ar Xiv Preprint, 2016a.

Vassileios Balntas, Edward Riba, Daniel Ponsa, and Krystian Mikolajczyk. Learning Local Feature Descriptors with Triplets and Shallow Convolutional Neural Networks. In British Machine Vision Conference, 2016b.

Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. HPatches: A Benchmark and Evaluation of Handcrafted and Learned Local Descriptors. In Conference on Computer Vision and Pattern Recognition, 2017.

Daniel Barath and Jiri Matas. Graph-Cut RANSAC. In Conference on Computer Vision and Pattern Recognition, pp. 6733 6741, 2018.

Daniel Barath, Jiri Matas, and Jana Noskova. MAGSAC: Marginalizing Sample Consensus. In Conference on Computer Vision and Pattern Recognition, 2019.

Daniel Barath, Jana Noskova, Maksym Ivashechkin, and Jiri Matas. MAGSAC++, A Fast, Reliable and Accurate Robust Estimator. In Conference on Computer Vision and Pattern Recognition, pp. 1301 1309, 2020.

Andrew Blake and Andrew Zisserman. Visual Reconstruction. MIT press, 1987.

Eric Brachmann and Carsten Rother. Neural-Guided RANSAC: Learning Where to Sample Model Hypotheses. In International Conference on Computer Vision, 2019.

Ruojin Cai, Bharath Hariharan, Noah Snavely, and Hadar Averbuch-Elor. Extreme Rotation Estimation using Dense Correlation Volumes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

Published as a conference paper at ICLR 2022

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging Properties in Self-Supervised Vision Transformers. In ar Xiv Preprint, 2021.

Kefan Chen, Noah Snavely, and Ameesh Makadia. Wide-baseline relative camera pose estimation with directional learning. In CVPR, 2021.

Christopher Choy, Jun Young Gwak, Silvio Savarese, and Manmohan Chandraker. Universal Correspondence Network. In Advances in Neural Information Processing Systems, 2016.

Christopher Choy, Junha Lee, René Ranftl, Jaesik Park, and Vladlen Koltun. High-Dimensional Convolutional Networks for Geometric Pattern Recognition. In Conference on Computer Vision and Pattern Recognition, pp. 11227 11236, 2020.

Ondrej Chum, Jiri Matas, and Josef Kittler. Locally Optimized RANSAC. In DAGM Symposium on Pattern Recognition, 2003.

Ondrej Chum, Tomas Werner, and Jiri Matas. Two-View Geometry Estimation Unaffected by a Dominant Plane. In Conference on Computer Vision and Pattern Recognition, pp. 772 779, 2005.

Jean-Baptiste Cordonnier, Andreas Loukas, and Martin Jaggi. On the Relationship Between Self Attention and Convolutional Layers. In ar Xiv Preprint, 2020.

Gabriela Csurka and Martin Humenberger. From Handcrafted to Deep Local Invariant Features. In Computing Research Repository, 2018.

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.

Daniel Detone, Tomasz Malisiewicz, and Andrew Rabinovich. Super Point: Self-Supervised Interest Point Detection and Description. In Conference on Computer Vision and Pattern Recognition, 2018.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ar Xiv Preprint, 2020.

Mihai Dusmanu, Ignacio Rocco, Tomás Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-Net: A Trainable CNN for Joint Description and Detection of Local Features. In Conference on Computer Vision and Pattern Recognition, 2019.

Martin A. Fischler and Robert C. Bolles. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. In Association for Computing Machinery, 1981.

Steffen Gauglitz, Tobias Höllerer, and Matthew Turk. Evaluation of Interest Point Detectors and Feature Descriptors for Visual Tracking. International Journal of Computer Vision, 94:335 360, 2011.

Hugo Germain, Guillaume Bourmaud, and Vincent Lepetit. S2DNet: Learning Image Features for Accurate Sparse-to-Dense Matching. In European Conference on Computer Vision, 2020.

Hugo Germain, Vincent Lepetit, and Guillaume Bourmaud. Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation. In Conference on Computer Vision and Pattern Recognition, 2021.

Albert Gordo, Jon Almazán, Jérôme Revaud, and Diane Larlus. Deep Image Retrieval: Learning Global Representations for Image Search. In European Conference on Computer Vision, 2016.

Linyi Jin, Shengyi Qian, Andrew Owens, and David F. Fouhey. Planar surface reconstruction from sparse views. Ar Xiv, abs/2103.14644, 2021.

Published as a conference paper at ICLR 2022

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franccois Fleuret. Transformers Are RNNs: Fast Autoregressive Transformers with Linear Attention. In International Conference on Machine Learning, 2020.

Xinghui Li, Kai Han, Shuda Li, and Victor Prisacariu. Dual-Resolution Correspondence Networks. Advances in Neural Information Processing Systems, 33, 2020.

Z. Li and Noah Snavely. Megadepth: Learning Single-View Depth Prediction from Internet Photos. In Conference on Computer Vision and Pattern Recognition, 2018.

Ilya Loshchilov and F. Hutter. Decoupled weight decay regularization. In ICLR, 2019.

D. G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2), 2004.

Zixin Luo, Tianwei Shen, Lei Zhou, Jiahui Zhang, Yao Yao, Shiwei Li, Tian Fang, and Long Quan. Context Desc: Local Descriptor Augmentation with Cross-Modality Context. In Conference on Computer Vision and Pattern Recognition, pp. 2522 2531, 2019.

Quan-Tuan Luong and Olivier D. Faugeras. The Fundamental Matrix: Theory, Algorithms, and Stability Analysis. International Journal of Computer Vision, 17(1):43 75, 1996.

Anastasiya Mishchuk, Dmytro Mishkin, Filip Radenovi c, and Jiri Matas. Working Hard to Know Your Neighbor s Margins: Local Descriptor Learning Loss. In Advances in Neural Information Processing Systems, 2017.

Kwang Moo Yi, Eduard Trulls, Yuki Ono, Vincent Lepetit, Mathieu Salzmann, and Pascal Fua. Learning to Find Good Correspondences. In Conference on Computer Vision and Pattern Recognition, pp. 2666 2674, 2018.

Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. Devito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic Differentiation in Pytorch. In Advances in Neural Information Processing Systems, 2017.

Shengyi Qian, Linyi Jin, and David F. Fouhey. Associative3d: Volumetric reconstruction from sparse views. In ECCV, 2020.

Filip Radenovi c, Giorgos Tolias, and Ondrej Chum. CNN Image Retrieval Learns from Bo W: Unsupervised Fine-Tuning with Hard Examples. In European Conference on Computer Vision, 2016.

Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. Stand-Alone Self-Attention In Vision Models. In Advances in Neural Information Processing Systems, 2019.

Jerome Revaud, Cesar De Souza, Martin Humenberger, and Philippe Weinzaepfel. R2D2: Reliable and Repeatable Detector and Descriptor. In Advances in Neural Information Processing Systems, pp. 12405 12415, 2019.

Ignacio Rocco, M. Cimpoi, Relja Arandjelovi c, Akihiko Torii, Tomás Pajdla, and Josef Sivic. Neighbourhood Consensus Networks. In Advances in Neural Information Processing Systems, 2018.

Ignacio Rocco, Relja Arandjelovi c, and Josef Sivic. Efﬁcient Neighbourhood Consensus Networks via Submanifold Sparse Convolutions. In European Conference on Computer Vision, 2020.

Ehab Salahat and Murad Qasaimeh. Recent Advances in Features Extraction and Description Algorithms: A Comprehensive Survey. In 2017 IEEE International Conference on Industrial Technology (ICIT), pp. 1059 1063, 2017.

Published as a conference paper at ICLR 2022

Paul-Edouard Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk. From Coarse to Fine: Robust Hierarchical Localization at Large Scale. In Conference on Computer Vision and Pattern Recognition, pp. 12716 12725, 2019.

Paul-Edouard Sarlin, Daniel Detone, Tomasz Malisiewicz, and Andrew Rabinovich. Super Glue: Learning Feature Matching with Graph Neural Networks. In Conference on Computer Vision and Pattern Recognition, 2020.

Torsten Sattler, W. Maddern, C. Toft, Akihiko Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, Marc Pollefeys, Josef Sivic, F. Kahl, and Tomás Pajdla. Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions. In Conference on Computer Vision and Pattern Recognition, 2018.

J. L. Schönberger and J.-M. Frahm. Structure-From-Motion Revisited. In Conference on Computer Vision and Pattern Recognition, 2016.

Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Face Net: A Uniﬁed Embedding for Face Recognition and Clustering. In Computing Research Repository, 2015.

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Learning Local Feature Descriptors Using Convex Optimisation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36, 2014.

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Lo FTR: Detector-Free Local Feature Matching with Transformers. In Conference on Computer Vision and Pattern Recognition, 2021.

Weiwei Sun, Wei Jiang, Eduard Trulls, Andrea Tagliasacchi, and Kwang Moo Yi. ACNe: Attentive Context Normalization for Robust Permutation-Equivariant Learning. In Conference on Computer Vision and Pattern Recognition, pp. 11286 11295, 2020.

Christian Szegedy, V. Vanhoucke, S. Ioffe, Jon Shlens, and Z. Wojna. Rethinking the Inception Architecture for Computer Vision. In Conference on Computer Vision and Pattern Recognition, pp. 2818 2826, 2016.

Philip HS Torr and Andrew Zisserman. MLESAC: A New Robust Estimator with Application to Estimating Image Geometry. Computer Vision and Image Understanding, 78(1):138 156, 2000.

Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In ar Xiv Preprint, 2017.

Qianqian Wang, Xiaowei Zhou, Bharath Hariharan, and Noah Snavely. Learning feature descriptors using camera pose supervision. Ar Xiv, abs/2004.13324, 2020.

Heng Yang, Wei Dong, Luca Carlone, and Vladlen Koltun. Self-supervised geometric perception. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14345 14356, 2021.

Zhenpei Yang, Jeffrey Z. Pan, Linjie Luo, Xiaowei Zhou, Kristen Grauman, and Qixing Huang. Extreme relative pose estimation for rgb-d scans via scene completion. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4526 4535, 2019.

Zhenpei Yang, Siming Yan, and Qi-Xing Huang. Extreme relative pose network under hybrid representations. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2452 2461, 2020.

Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. LIFT: Learned Invariant Feature Transform. In European Conference on Computer Vision, 2016.

Jiahui Zhang, Dawei Sun, Zixin Luo, Anbang Yao, Lei Zhou, Tianwei Shen, Yurong Chen, Long Quan, and Hongen Liao. Learning Two-View Correspondences and Geometry Using Order-Aware Network. In International Conference on Computer Vision, 2019.

Published as a conference paper at ICLR 2022

Hengshuang Zhao, Jiaya Jia, and Vladlen Koltun. Exploring Self-Attention for Image Recognition. In Conference on Computer Vision and Pattern Recognition, pp. 10073 10082, 2020.

Qunjie Zhou, Torsten Sattler, and Laura Leal-Taixé. Patch2pix: Epipolar-guided pixel-level correspondences. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4667 4676, 2021.

Published as a conference paper at ICLR 2022

In the following pages, we present additional experiments and technical details about our visual correspondence hallucination method Neur Hal. We present additional experiments on the ability to hallucinate in Sec. A and on camera pose estimation in Sec. B. We describe technical details in Sec. C and provide additional qualitative results in Sec. D.

A Additional experiments concerning the ability to hallucinate correspondences 15 A.1 Impact of learning to inpaint and outpaint . . . . . . . . . . . . . . . . . . . . . 15 A.2 Ability to hallucinate: Neur Hal hallucination vs. Homography-based warping . . 16 A.3 Additional insights on correspondence hallucination . . . . . . . . . . . . . . . 16 A.4 Inpainted vs. outpainted correspondence maps . . . . . . . . . . . . . . . . . . 16 A.5 Ability to hallucinate: Neur Hal vs. (Germain et al., 2021) . . . . . . . . . . . . 16

B Additional experiments concerning the application to camera pose estimation 18 B.1 Inﬂuence of the pose estimator: (Germain et al., 2021) vs. (Chum et al., 2003) . . 19 B.2 Impact of the value of γ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 B.3 Analysis of the impact of inpainting and outpainting . . . . . . . . . . . . . . . 20 B.4 Additional indoor pose estimation results . . . . . . . . . . . . . . . . . . . . . 20

C Technical details 22 C.1 Architecture details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 C.2 Datasets and Training details . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 C.3 Evaluation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

D Additional qualitative results 25 D.1 Generalization to new datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 D.2 Qualitative correspondence hallucination results and failure cases . . . . . . . . 25 D.3 Qualitative camera pose estimation results . . . . . . . . . . . . . . . . . . . . 25

A ADDITIONAL EXPERIMENTS CONCERNING THE ABILITY TO HALLUCINATE CORRESPONDENCES

In this section we ﬁrst present an additional ablation study on the ability to hallucinate, followed by additional insights on our model internal functioning.

A.1 IMPACT OF LEARNING TO INPAINT AND OUTPAINT

To supplement the study made in Sec. 4.1, we now aim at evaluating the impact of learning to inpaint and outpaint speciﬁcally. To do so, we isolate keypoints with the identiﬁed, inpainted and outpainted labels in our Scan Net (Dai et al., 2017) evaluation set.

In Fig. 8, we show the results of an ablation study on Neur Hal s training setup. We report for the identiﬁcation, inpainting and outpainting tasks two sets of cumulative histograms: 1) the NRE costs at ground truth keypoint correspondents locations, and 2) distances between the argmax of the correspondence map and the ground truth location. On NRE cost cumulative histograms, we also report the results from the uniform distribution, for models trained both with and without outpainting (γ = 0% and γ = 50% respectively).

For the identiﬁcation task (Fig. 8 (a)) we ﬁnd that all methods yield a consistent performance. The left ﬁgure reveals that Neur Hal predictions are signiﬁcantly above the uniform distribution, indicating peaky maps and thus conﬁdent predictions. The right ﬁgure shows that the distance of the argmax location w.r.t. the ground truth is also robust (Neur Hal predicts at 1/8th of the original resolution but the histogram is computed at full resolution).

Published as a conference paper at ICLR 2022

For the inpainting task (Fig. 8 (b)) we can draw similar conclusions. We ﬁnd however that correspondence maps are overall less peaky and closer to their respective uniform distribution, which indicates that predictions are less conﬁdent. We also ﬁnd that even though it was not trained to inpaint, the identiﬁcation baseline is surprisingly able to inpaint correspondences as its performance is not far from the identiﬁcation+inpainting model.

Lastly for the outpainting task (Fig. 8(c)), we ﬁnd that learning to outpaint gives a signiﬁcant boost in performance on both the NRE distribution and correspondents locations. We also ﬁnd that jointly learning to inpaint and outpaint is beneﬁcial to the quality of the outpainted cost maps, which implies that both objectives are complementary.

A.2 ABILITY TO HALLUCINATE: NEURHAL HALLUCINATION VS. HOMOGRAPHY-BASED WARPING

To compare the ability to hallucinate of Neur Hal against a non learning-based approach, we report in Fig. 9 the performance of homography-based warping approaches that do no rely on correspondence hallucination.

We compare the performance of a) Neur Hal trained to both identify and hallucinate correspondences against b) a version of Neur Hal trained without correspondence hallucination followed by a homography-based warping stage to hallucinate correspondences. We derive several baselines of b). We report the performance obtained using a simple least-squares solver from all predicted correspondences, and a RANSAC alternative. We also report the performance of using an oracle prior to the homography estimation stage to ﬁlter out outlier correspondence predictions. We lastly report the performance of estimating the homography using ground-truth identiﬁable correspondences inside a RANSAC loop. For completeness we show the performance of RANSAC-based homography estimation for several inlier thresholds. In all cases, we ﬁnd that performing correspondence hallucination using Neur Hal signiﬁcantly outperforms all homography-based alternatives that do not resort to ground-truth information.

This can be attributed to the lack of sufﬁcient visual overlap between image pairs that prevent from obtaining an accurate homography estimate, as well as the planar-assumption of this model.

Interestingly, it can be seen that with perfect correspondences inpainted correspondences can be accurately recovered. On the other hand, outpainted correspondences do not seem to be robustly retrievable through a simple homography estimation.

A.3 ADDITIONAL INSIGHTS ON CORRESPONDENCE HALLUCINATION

While it is tempting to draw hypotheses regarding the internal functioning of Neur Hal, we would like to highlight that this should be done with great care. Indeed, the sheer complexity of the operations run in the many attention layers of Neur Hal prevents from interpreting the reasoning that leads to the outputted correspondence maps. From a higher level however, we can reasonably speculate that our model implicitly learns to jointly predict the depth of the source image and the relative camera pose to warp the source keypoints and hallucinate their correspondents in the target image.

A.4 INPAINTED VS. OUTPAINTED CORRESPONDENCE MAPS

We can observe that the inpainted correspondence maps are less peaked than the outpainted correspondence maps (see Fig. 3). We believe this is because outpainting a correspondent essentially consists in transferring the features from location ps,i in the source features to location pt,i in the target features to obtain ds,i = DT,pad(pt,i). In the inpainting case, ps,i is occluded by an object in the target image, and this object is (often) also visible in the source image at location ps,occ. Thus, to produce peaked correspondence maps for both ps and ps,occ , the network has to output features such that ds,i = DT,pad(pt,i) = ds,occ which is more difﬁcult than just ds,i = DT,pad(pt,i)

A.5 ABILITY TO HALLUCINATE: NEURHAL VS. (GERMAIN ET AL., 2021)

The architecture proposed by Germain et al. (2021) is not able to predict outpainted correspondences. They do consider an extra category for un-matched keypoints but the probability for that category is

Published as a conference paper at ICLR 2022

0 50 100 150 0

Argmax distance (px)

(a) Identiﬁed

Precision (%)

0 50 100 150 0

Argmax distance (px)

(b) Inpainted

Precision (%)

0 50 100 150 0

Argmax distance (px)

(c) Outpainted

Method Training correspondences

Identiﬁed Inpainted Outpainted

ln(|Ω|)γ=0% ln(|Ω|)γ=50%

Figure 8: Ability to hallucinate - Ablation study on Scan Net. We compare the inﬂuence of adding inpainting and outpainting when training Neur Hal. (left column) We report the percentage of keypoint s correspondents whose NRE cost is lower than x, as a function of x, for (a) identiﬁed (b) inpainted and (c) outpainted keypoints. (right column) We report the percentage of keypoint s correspondents whose distance w.r.t. the ground truth is lower than x pixels, as a function of x, for the same categories.

Published as a conference paper at ICLR 2022

0 50 100 150 200 0

Distance (px)

(a) Inpainting

0 50 100 150 200 0

Distance (px)

(b) Outpainting

0 50 100 150 200 0

Distance (px)

(c) Inpainting - RANSAC

0 50 100 150 200 0

Distance (px)

(d) Outpainting - RANSAC

0 50 100 150 200 0

Distance (px)

(e) Inpainting - GT + RANSAC

0 50 100 150 200 0

Distance (px)

(f) Outpainting - GT + RANSAC

τ = 10px τ = 25px τ = 50px τ = 100px τ = 200px τ = 400px τ = 600px

Method Hallucination Warping

Homography - LS Homography - RANSAC - τ = 200px

Homography - LS + Oracle Homography - GT + RANSAC - τ = 50px

Figure 9: Ability to hallucinate - Homography-based warping: We compare the performance of 1) Neur Hal trained to both identify and hallucinate correspondences against 2) a version of Neur Hal trained without correspondence hallucination followed by a homography-based warping stage to hallucinate correspondences and 3) a homography estimated from the ground truth (GT) identiﬁed correspondences. We report the performance to predict (a) inpainted and (b) outpainted correspondents locations. For each method we report the percentage of keypoint s correspondents whose distance w.r.t. the ground truth location is lower than x pixels, as a function of x. We show in (c) and (d) the performance of RANSAC-based homography estimation for various inlier thresholds. We show in (e) and (f) similar curves when using ground-truth correspondences to estimate the homography. See text for more details.

Published as a conference paper at ICLR 2022

set to zero, ie. the network does not output any score for that category. Setting this probability to zero is due to the fact that (Germain et al., 2021) considers a classical siamese CNN architecture that does not allow the features of both images to communicate. (Germain et al., 2021) is what we called, in the introduction, a "pure pattern recognition approach". Moreover, even if (Germain et al., 2021) were using a non-siamese architecture, their method would output a single score for the category "un-matched keypoint" which would allow the network to detect when the correspondent is not visible but would not be sufﬁcient to outpaint the location of the correspondent.

B ADDITIONAL EXPERIMENTS CONCERNING THE APPLICATION TO CAMERA POSE ESTIMATION

In this section, we present additional experiments on correspondence hallucination for camera pose estimation. We begin with a study on the impact of the pose estimator in Sec. B.1, followed by a study on the impact of the padding value γ in Sec. B.2. Lastly, we present in Sec. B.4 additional results on indoor camera pose estimation.

B.1 INFLUENCE OF THE POSE ESTIMATOR: (GERMAIN ET AL., 2021) VS. (CHUM ET AL., 2003)

(Germain et al., 2021) provides a pose estimation framework which leverages dense keypoint matching uncertainties to predict more accurate and robust camera poses. Compared to the standard pose estimator presented in (Chum et al., 2003) which relies on sparse 2D-to-3D correspondences, the method from (Germain et al., 2021) preserves rich information in the form of dense loss maps that is particularly suited for ambiguous matches. For the problem of correspondence hallucination we ﬁnd the loss maps of both outpainted and inpainted correspondences are usually unimodal but quite diffuse, and are thus particularly suited for this pose estimator.

To study the inﬂuence of the pose estimator, we report in Fig. 10 the performance of Neur Hal + (Germain et al., 2021) vs. Neur Hal + (Chum et al., 2003). To estimate the camera pose using the method presented in (Chum et al., 2003), we simply take the argmax of each correspondence map and treat it as a sparse 2D correspondent in the query image. We also include the performance of Neur Hal when trained without visual correspondence hallucination (i.e. trained using only identiﬁed ground truth correspondences.)

We ﬁnd that the two methods trained without hallucination have poor performances for very lowoverlap image pairs which underlines the importance of correspondence hallucination in such cases.

Concerning Neur Hal trained with hallucination and using the pose estimator (Chum et al., 2003), taking the argmax of a very coarse correspondence map prevents the pose estimator from achieving good results.

Neur Hal trained with hallucination and coupled with the pose estimator of Germain et al. (2021) achieves the best results which shows that to obtain robust absolute camera estimates it is important to combine the ability to hallucinate correspondences of Neur Hal with the pose estimator from (Germain et al., 2021).

B.2 IMPACT OF THE VALUE OF γ

We report in Fig. 11 the absolute camera pose estimation performance for varying values of γ. We compute the percentage of camera poses being correctly estimated for Scan Net (Dai et al., 2017) test images pairs that have an overlap between 2% and x% (as a function of x) for a translation threshold of 1.5m and a rotation threshold of 20.0 .

We ﬁnd that using only a small percentage of outpainting such as γ = 10% does not improve the performance which is most likely due to the small amount of added training keypoints. For higher γ values however signiﬁcant gains are visible, especially at small visual overlaps. This experiment demonstrates the beneﬁt of learning to outpaint correspondences beyond image borders, and broaden the extent of usable source keypoints to perform camera pose estimation.

Published as a conference paper at ICLR 2022

0 20 40 60 80 0

Visual Overlap (%)

Method Hallucination Pose estimator

(Chum et al., 2003) (Chum et al., 2003)

(Germain et al., 2021) (Germain et al., 2021)

Figure 10: Inﬂuence of the pose estimator: (Germain et al., 2021) vs. (Chum et al., 2003): To study the inﬂuence of using the pose estimator proposed in (Germain et al., 2021) compared to using the pose estimator from (Chum et al., 2003), we report the performance of Neur Hal with both estimators. We also include, for both estimators, the performance of Neur Hal trained with identiﬁed correspondences only (i.e. without hallucination). We report the percentage of camera poses being correctly estimated for pairs of Scan Net (Dai et al., 2017) images that have an overlap between 2% and x% (as a function of x).

0 20 40 60 80

Visual Overlap (%)

γ = 50% γ = 25% γ = 10%

Figure 11: Impact of the value of γ: For increasing values of γ, we report the percentage of camera poses being correctly estimated for pairs of Scan Net images that have an overlap between 2% and x% (as a function of x), for τt = 1.5m and τr = 20.0 . We ﬁnd that a small value of γ = 10% yields no beneﬁt and even damages performance, while values of γ = 25% and γ = 50% bring signiﬁcant improvements, especially at small visual overlaps.

We report in Fig. 12 the camera ﬁeld-of-view as a function of the padding parameter. We ﬁnd that γ = 50% provides 130 and 71 of ﬁeld-of-view on average on Scan Net and Megadepth respectively, which is signiﬁcantly wider than γ = 0%.

B.3 ANALYSIS OF THE IMPACT OF INPAINTING AND OUTPAINTING

In Fig. 11 we reported the percentage of camera poses being correctly estimated for several values of γ, which demonstrates the beneﬁts of outpainting with a large γ for camera pose estimation. In Fig. 6 we also showed that learning to inpaint does not bring any signiﬁcant improvement. We believe that outpainting improves the camera pose because outpainted correspondences are outside the ﬁeld of view and thus complement the identiﬁed correspondences, and thus better constrain the camera pose estimate. On the contrary, inpainted correspondences are usually surrounded by identiﬁed correspondences, thus the information they provide is redundant and does not allow to better constrain the camera pose estimate.

Published as a conference paper at ICLR 2022

0 10 20 30 40 50 0

Field of View ( )

(a) Field-of-view w.r.t. γ

Dataset | rx| | ry| | rz| | θ| | f|

Scan Net 29.21 38.72 25.68 55.20 0.00mm Megadepth 4.73 6.91 1.64 20.25 376.69mm

(b) Relative viewpoint statistics

0 50 100 150 0.00

0 50 100 150 0.00

(c) Histogram of absolute relative angle norm | θ|

Figure 12: Field-of-view as a function of γ and relative viewpoint statistics: We report in (a) the average camera ﬁeld-of-view as a function of γ on Scan Net (Dai et al., 2017) and Megadepth (Li & Snavely, 2018) images. We ﬁnd that γ = 50% enables a signiﬁcant amount of additional visual content to reproject within the image boundaries. We report in (b) the median absolute difference in rotation along the x, y and z axis, norm of the relative rotation, along with the difference in focal length on low-overlap image pairs for Scan Net (Dai et al., 2017) and Megadepth (Li & Snavely, 2018). We report in (c) the histogram of absolute relative angle norm on both datasets. We ﬁnd Scan Net image pairs exhibit strong relative angular motion while Megadepth image pairs display predominantly zoom-ins and zoom-outs.

Published as a conference paper at ICLR 2022

Threshold ( )

Threshold (m)

Precision Precision

Threshold ( )

Threshold (m)

Threshold ( )

Threshold (m)

Threshold ( )

Threshold (m)

Threshold ( )

Threshold (m)

Threshold ( )

Threshold (m)

Threshold ( )

Threshold (m)

Identity R2D2 (Revaud et al., 2019) SP+SG (Sarlin et al., 2020) Lo FTR (Sun et al., 2021) DRCNet (Li et al., 2020) S2D (Germain et al., 2021) Neur Hal

Figure 13: Camera pose estimation experiment - Worst cases: We report the performance of Neur Hal and state-of-the-art feature matching methods on Scan Net (Dai et al., 2017) image pairs with visual overlaps between 2% and 5%. For every column, we subselect the 25% of images pairs with the worst predictions for a given method. We ﬁnd that in all cases, Neur Hal strongly outperforms its competitors. On the contrary, on the worst Neur Hal predictions state-of-the-art methods achieve a much lower performance, which is either on par or lower than the predictions obtained using the Identity.

Threshold ( ) Threshold (m)

Identity R2D2 (Revaud et al., 2019) SP+SG (Sarlin et al., 2020) Lo FTR (Sun et al., 2021) DRCNet (Li et al., 2020) S2D (Germain et al., 2021) Neur Hal

Figure 14: Camera pose estimation experiment - varying the threshold values: We report the performance of Neur Hal and state-of-the-art feature matching methods on Scan Net (Dai et al., 2017) image pairs with visual overlaps between 2% and 5%. For various angular and translation thresholds we report the percentage of correctly localized images. We ﬁnd that in all cases, Neur Hal strongly outperforms its competitors.

B.4 ADDITIONAL INDOOR POSE ESTIMATION RESULTS

In addition to the results presented in Fig. 7, we report in Fig. 13 the performance of Neur Hal and state-of-the-art feature matching methods on Scan Net (Dai et al., 2017) image pairs with visual overlaps between 2% and 5%. For every method, we subselect the 25% of images pairs with the worst predictions, and compare it with the performance of its competitors. We ﬁnd that in all cases, Neur Hal strongly outperforms its competitors. On the worst Neur Hal predictions, state-of-the-art methods achieve a much lower performance. For this category we can observe that all Neur Hal competitors are either on par or achieve a lower performance than the Identity predictions.

This ﬁgure highlights the fact that when Neur Hal fails to correctly estimate the camera pose, all the competitors also fail since all the methods perform similarly to the "identity" method, i.e. the method that consists in systematically predicting the identity pose.

Fig. 14 shows that Neur Hal is much more robust than state-of-the-art local feature matching methods for pairs of images with a low overlap.

Published as a conference paper at ICLR 2022

C TECHNICAL DETAILS

C.1 ARCHITECTURE DETAILS

Neur Hal s architecture can be separated in two building blocks: the convolutional backbone and the multi-head attention block.

Convolutional backbone. The convolutional backbone consists of a truncated Inceptionv3 (Szegedy et al., 2016) model (up to Mixed-6a, 768-dimensional descriptors), modiﬁed as per Germain et al. (2021) to provide, in the case of Scan Net (Dai et al., 2017), a 1/8 output-to-input resolution ratio. To help with memory consumption we apply a simple 2D convolutional layer to compress the descriptor size to 384. In the case where γ > 0, we subsequently pad HT with the learned vector λ, producing HT,pad.

Positional encoding. After computing HS and HT,pad with the convolutional backbone, positional encoding is applied to both dense feature maps. Similarly to Super Glue (Sarlin et al., 2020), we use a 6-layer MLP of size (32, 64, 128, 256, 384), mapping a positional meshgrid between ( 1, 1) (centered around the image center) to higher dimensionalities. Batch Norm and Re LU layers are placed between every module. In our experiments, we tried adding more positional encoding layers but found it did not make a difference in performance. After applying the positional encoding, sparse descriptors {d S,n}n=1...N are bilinearly interpolated at {p S,n}n=1...N in HS.

Self-attention. Following the positional encoding, a single multi-head attention layer is applied on HT,pad, with 4 heads. It consists of a standard dot-product attention (Vaswani et al., 2017), coupled with a gating mechanism. For a given query Q, key K and value V , we compute the attention as Attention(Q, K, V ) = softmax(g QKT )V where g = σ(max(QK)). To mitigate the quadratic cost of the dot-product attention, we also apply a max-pooling operator on keys and values with a stride of 2, as we empirically found it had very little impact on performance. We also tried using a Linear Transformer (e.g. Lin Former (Katharopoulos et al., 2020)) architecture, but despite trying numerous variants we found it consistently damaged the convergence of the model.

Cross-attention. Using the same attention-layer design, we subsequently apply it once between {d S,n}n=1...N and HS. This layer allows for communication between the interpolated source descriptors which will be used to produce the ﬁnal correspondence maps, and the original dense source image content. Then, we apply k cross-attention layers between {d S,n}n=1...N and HT,pad. We empirically found these layers to be most important, as they allow for direct communication between the sparse source descriptors and the dense target feature maps, prior to the correspondence maps computation. After trying different values for k and with memory consumption in mind, we settled for k = 4 in all our experiments.

Afﬁne transformation. The afﬁne transformation matrix KC for the correspondence map CT of resolution (WT, HT) is computed from the target image calibration matrix KT and downscaling factors s using:

"0 0 γWT 0 0 γHT 0 0 0

We can determine if a point lies within the boundaries of CT if its (x, y) coordinates are between ( γWT, γHT) and ((1 + γ)WT, (1 + γ)HT).

Implementation. The model is implemented in Py Torch (Paszke et al., 2017). For an indoor sample with 2000 keypoints it has an average throughput of 8.84 image/s on an NVIDIA RTX 3070 GPU. We report the number of parameters in our model in Table 1.

Published as a conference paper at ICLR 2022

Layer # of parameters

CNN 2.4 M Positional Encoding 142 K Self-Attention 1.9 M Cross-Attention 7.2 M

Total 11.7 M

Table 1: Number of parameters in Neur Hal

C.2 DATASETS AND TRAINING DETAILS

Scan Net. The Scan Net (Dai et al., 2017) dataset is a large-scale indoor dataset containing monocular RGB videos and dense depth images, along with ground truth absolute camera poses. As Super Glue (Sarlin et al., 2020) and Lo FTR (Sun et al., 2021), we pre-compute the visual overlaps between all image pairs for both training and test scenes. For the training set we sample images with a visual overlap between 2% and 50% from the Scan Net training scenes, which provides us with challenging images to handle. We assemble 6M image pairs and randomly subsample 200k pairs at every training epoch. For testing images, we sample 2, 500 image pairs with overlaps between 2% and 80% from the Scan Net testing scenes, using several bins to ensure the sampling is close to being uniform. For both training and testing images, we sample keypoints in the source image along a regular grid with cell sizes of 16 pixels. We remove keypoints with invalid depth, as well as those where the local depth gradient is too high, as the depth information might not be reliable. We mark keypoints falling outside the target image plane as being outpainted, and we automatically detect the keypoints to inpaint through a cyclic projection of the source keypoints to the target image and back. The remaining keypoints are labeled as identiﬁable. For all Scan Net experiments, Neur Hal uses a 1/8 output-to-input resolution ratio, with a target correspondence map maximum edge size of 80 pixels (when γ = 0%).

Megadepth. We use Megadepth (Li & Snavely, 2018) to train and evaluate Neur Hal on outdoor images. This dataset contains over one million images captured in touristic places, and split in 196 scenes. To train Neur Hal and following Germain et al. (2021) guidelines, we use the provided SIFT (Lowe, 2004)-based 3D reconstruction which was made with COLMAP (Schönberger & Frahm, 2016). Because the sparse 3D point cloud comes from Sf M, we ﬁnd however that very little keypoints can be marked as inpainted. Indeed, no 3D reconstruction is applied to objects or people occluding the scene. To allow for a wide variety of image pairs we use the sparse reconstruction to estimate the visual overlap and sample pairs with an overlap between 20% and 100%. We however ﬁnd this overlap estimation to be quite unreliable, as only part of the scene is usually reconstructed. Since Megadepth (Li & Snavely, 2018) images are of much higher resolution than Scan Net (Dai et al., 2017), we conﬁgure Neur Hal to use a 1/16 output-to-input resolution (with a simple max-pooling layer in the CNN). We set the target correspondence map maximum edge size of 60 pixels (when γ = 0%), to allow for space in memory when γ = 50%.

Overlap estimation. For a given pair of images, we approximate the visual overlap by computing the covisibility ratio of keypoints for every image pair. For a given source and target image pair, we ﬁrst compute the source-to-target and target-to-source covisibility ratios using ground truth depth data and camera poses. We then deﬁne the visual overlap as the minimum between both ratios. On Megadepth we ﬁnd this overlap estimation to be fairly noisy, as depth is only partially known.

Optimizers and scheduling. On both datasets Neur Hal is trained for a maximum of 40 epochs. We use an initial learning rate of 10 3, with a linear learning rate warm-up in 3 epochs from 0.1 of the initial learning rate. As Sun et al. (2021), we decay the learning rate by 0.5 every 8 epochs starting from the 8th epoch. We apply the linear scaling rule and use a batch size of 8 over 8 NVIDIA V100 GPUs. We use the Adam W (Loshchilov & Hutter, 2019) optimizer, with a weight decay of 0.1. In all training procedures, we randomly initialize the model weights.

Published as a conference paper at ICLR 2022

C.3 EVALUATION DETAILS

Evaluation protocol. All baselines follow the same standard protocol in which we: 1) Compute 2D-2D correspondences between the reference image and the query image, 2) Lift these 2D-2D correspondences to 2D-3D correspondences using the available 3D information for the reference image, 3) Estimate the camera pose given these 2D-3D correspondences by minimizing the Reprojection Error (RE), i.e. applying LO-RANSAC+Pn P (Chum et al., 2003) followed by a non-linear iterative reﬁnement. This approach is widely used and leads to state-of-the-art results in visual localization benchmarks. We also include results for Germain et al. (2021) which we call S2D. For the evaluation of Fig. 4, we ﬁnd the inpainted and outpainted correspondents for Lo FTR (Sun et al., 2021) and DRCNet (Li et al., 2020) by fetching the argmax 2D coordinates in the 4D matching conﬁdence volume. For S2D and Neur Hal, we simply take the argmax in correspondence maps for the same set of keypoints.

Choice of threshold. We reported in Fig. 14 the performance of Neur Hal, state-of-the-art feature matching methods and the identity pose, on Scan Net for several rotation and translation thresholds. We can see that arbitrarily choosing a threshold of τt = 1.5m and τr = 20.0 sets a hard objective as the identity pose is particularly poor.

(Chum et al., 2003)-based pose estimator. For all (Chum et al., 2003)-based methods, we estimate the camera pose using the pycolmap python binding. We tune the RANSAC threshold for optimal performance, and mark all cases where less than 3 valid correspondences (i.e. with a valid depth value) as failure cases (inﬁnite pose error). The remaining parameters are left as default. We follow the evaluation instructions provided by each method, and use indoor weights for SP+SG (Sarlin et al., 2020) and the dual-softmax indoor weights for Lo FTR (Sun et al., 2021). In the case of Neur Hal + (Chum et al., 2003), we simply read the argmax of the predicted correspondence maps to obtain explicit 2D-to-3D correspondences.

(Germain et al., 2021)-based pose estimator. For both S2D (Germain et al., 2021) and Neur Hal we only use coarse models, which operate at either 1/8th or 1/16th of the original input resolution. We ﬁrst retrain the S2D coarse model (fully-convolutional Inceptionv3 (Szegedy et al., 2016), up to Mixed-6e) on the same training set as our method, with the same target resolution of 80 pixels. We refer to this model as S2D. Given correspondence maps and the depth map of the source image, we estimate the camera pose between the target image and the source image using the method proposed in Germain et al. (2021). For both S2D and Neur Hal we use the same set of regularly sampled source keypoints (see Sec. C.1), and we perform camera pose estimation ﬁrst using P3P inside an MSAC (Torr & Zisserman, 2000) loop. We run P3P for a maximum of 5, 000 iterations over the top-20% correspondences. We then apply a coarse GNC (Blake & Zisserman, 1987) over all source keypoints with σmax = 2.0 and σmin = 0.6. Let us highlight that in all the camera pose experiments, the performances of Neur Hal are obtained by predicting only low resolution correspondence maps (see Sec. C.1).

D ADDITIONAL QUALITATIVE RESULTS

D.1 GENERALIZATION TO NEW DATASETS

So far we have demonstrated the ability of Neur Hal to hallucinate correspondences on unseen validation scenes from both Scan Net (Dai et al., 2017) and Megadepth (Li & Snavely, 2018). In order to further demonstrate the generalization capacity of Neur Hal, we report qualitative results obtained on the NYU Depth Dataset (Nathan Silberman & Fergus, 2012) in Fig. 16 and on the ETH-3D (Schöps et al., 2017) dataset in Fig. 15. We use the set of indoor weights for NYU (i.e. Neur Hal trained on Scan Net) and outdoor weights for ETH-3D (i.e. Neur Hal trained on Mega Depth). We report the overlayed and upsampled coarse truncated loss map computed following Germain et al. (2021) on low-overlap image pairs. We ﬁnd that Neur Hal is able to robustly outpaint correspondences despite little visual overlaps and strong relative camera motions. These visuals demonstrate the strong generalization ability of Neur Hal.

Published as a conference paper at ICLR 2022

D.2 QUALITATIVE CORRESPONDENCE HALLUCINATION RESULTS AND FAILURE CASES

To further demonstrate the ability of Neur Hal to perform visual correspondence hallucination, we report in Fig. 17 and Fig. 18 qualitative results on Scan Net (Dai et al., 2017) and Megadepth (Li & Snavely, 2018) respectively on scenes that were not seen at training-time. In the target image and in the (negative log) correspondence map, the red dot represents the ground truth keypoint s correspondent. The dashed rectangles represent the borders of the target images.

Let us recall that Neur Hal outputs probability distributions (a.k.a. correspondence maps) assuming the two input images are partially overlapping. It is essential to keep this assumption in mind when looking at these qualitative results. For instance, concerning the example Fig. 17 (b) (middle), it is very difﬁcult for our human visual system to be sure that the two images are actually overlapping, and consequently the network prediction seems to good to be true. However, if we assume that there is an overlap, we realize that it is actually possible to perform correspondence hallucination, by drawing out the two skirting boards, to correctly outpaint the correspondent.

In fact, this overlapping assumption has a regularization effect in cases where the covisible image areas show no distinctive regions, and one image could be at an inﬁnite translation of the other, e.g. Fig. 17 (b) (second to last).

In Fig. 17 (d) and Fig. 18 (d) we show failure cases where the correspondence maps modes predicted by Neur Hal are either partially or completely off. We ﬁnd that failure cases often correlate with strongly ambiguous image pairs, or images that have extremely limited visual overlap.

D.3 QUALITATIVE CAMERA POSE ESTIMATION RESULTS

We show in Fig. 19 qualitative results in camera pose estimation on low-overlap images from Scan Net (Dai et al., 2017), for Neur Hal and its three best-performing competitors. For every method we display the keypoints used as input to the camera pose estimator in the source image, along with their reprojection at the estimated camera pose in the target image. For methods using the pose estimator from (Chum et al., 2003), the keypoints are those that have been successfully matched. When using the pose estimator of Germain et al. (2021), the keypoints are those involved in the prediction of the dense NRE maps. We color in keypoints based on their spatial 2D position in the source image. We ﬁnd that Neur Hal strongly beneﬁts from its outpainting ability, in comparison with all other competitors which struggle to ﬁnd both sufﬁcient and reliable correspondences. We also report in Fig. 20 failure cases for Neur Hal. We ﬁnd that such cases correspond to image pairs exhibiting extremely limited visual overlap, strong camera pose rotations and overall signiﬁcant ambiguities.

Published as a conference paper at ICLR 2022

Source Target Source Target

Figure 15: Qualitative results on the ETH3D dataset: We evaluate Neur Hal on outdoor image pairs from the ETH-3D (Schöps et al., 2017) dataset and ﬁnd it is able to outpaint correspondences despite low visual overlaps. We report pairs of source and target images and overlay the upsampled coarse loss map corresponding to the source detection (in red) on the target image.

Published as a conference paper at ICLR 2022

Source Target Source Target

Figure 16: Qualitative results on the NYU dataset: We evaluate Neur Hal on indoor images from the NYU (Nathan Silberman & Fergus, 2012) dataset and ﬁnd it is able to outpaint correspondences despite low visual overlaps. We report pairs of source and target images and overlay the upsampled coarse loss map corresponding to the source detection (in red) on the target image.

Published as a conference paper at ICLR 2022

- ln CT Target Source

(a) Identiﬁcation / Inpainting Examples

- ln CT Target Source

(b) Outpainting Examples

- ln CT Target Source

(c) Challenging Examples

- ln CT Target Source

(d) Borderline / Failure Cases

Figure 17: Additional qualitative Scan Net (Dai et al., 2017) examples. See text for details.

Published as a conference paper at ICLR 2022

- ln CT Target Source

(a) Identiﬁcation / Inpainting Examples

- ln CT Target Source

(b) Outpainting Examples

- ln CT Target Source

(c) Challenging Examples

- ln CT Target Source

(d) Borderline / Failure Cases

Figure 18: Additional qualitative Megadepth (Li & Snavely, 2018) examples. See text for details.

Published as a conference paper at ICLR 2022

SP+SG Lo FTR DRCNet Neur Hal

Figure 19: Qualitative camera pose estimation results on low-overlap images from Scan Net (Dai et al., 2017): We show for every method keypoints used as input for the camera pose estimator in the source image (left image), along with their predicted reprojection in the target image (right image). We color-code keypoints based 2D spatial position in the source image. We also report for every pair and every method the camera pose estimation error in translation and rotation, colored in green when the pose is less than τt = 0.5m and τr = 10.0 , and in red otherwise.

Published as a conference paper at ICLR 2022

SP+SG Lo FTR DRCNet Neur Hal

Figure 20: Neur Hal failure cases on low-overlap images from Scan Net (Dai et al., 2017): We report cases where Neur Hal fails to estimate a camera pose with an error less than τt = 0.5m and τr = 10.0 . We ﬁnd these cases often correlate with extremely low covisibility coupled with strong camera rotations.