# atloc_attention_guided_camera_localization__aa51ba42.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

At Loc: Attention Guided Camera Localization

Bing Wang, Changhao Chen,* Chris Xiaoxuan Lu, Peijun Zhao, Niki Trigoni, Andrew Markham Department of Computer Science, University of Oxford ﬁrstname.lastname@cs.ox.ac.uk

Deep learning has achieved impressive results in camera localization, but current single-image techniques typically suffer from a lack of robustness, leading to large outliers. To some extent, this has been tackled by sequential (multi-images) or geometry constraint approaches, which can learn to reject dynamic objects and illumination conditions to achieve better performance. In this work, we show that attention can be used to force the network to focus on more geometrically robust objects and features, achieving state-of-the-art performance in common benchmark, even if using only a single image as input. Extensive experimental evidence is provided through public indoor and outdoor datasets. Through visualization of the saliency maps, we demonstrate how the network learns to reject dynamic objects, yielding superior global camera pose regression performance. The source code is avaliable at https://github.com/Bing CS/At Loc.

Introduction Location information is of key importance to wide variety of applications, from virtual reality to delivery drones, to autonomous driving. One particularly promising research direction is camera pose regression or localization - the problem of recovering the 3D position and orientation of a camera from an image or set of images. Camera localization has been previously tackled by exploiting the appearance and geometry in a 3D scene, for example, key points and lines, but suffers from performance degradation when deployed in the wild (Brachmann et al. 2017; Walch et al. 2017). This is due to the fact that the handcrafted features change signiﬁcantly across different scenarios due to lighting, blur and scene dynamics leading to poor global matches. Recent deep learning based approaches are able to automatically extract features and directly recover the absolute camera pose from a single image, without any hand-engineering effort, as was demonstrated in the seminal Pose Net (Kendall, Grimes, and Cipolla 2015). Extensions include the use of different encoder networks e.g. Res Net in Pose Net Hourglass (Melekhov et al. 2017) or geometric

*Changhao Chen is the corresponding author. Copyright 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

3RVH1HW $W/RF 2XUV

Figure 1: Saliency maps of one scene selected from Oxford Robot Car (Maddern et al. 2017) indicate that At Loc is able to force the neural network model to focus on geometrically robust objects (e.g. building structures in the right) rather than environmental dynamics (e.g. moving vehicles in the left) compared with Pose Net+ (Brahmbhatt et al. 2018).

constraints (Kendall and Cipolla 2017). Although these techniques show good performance in general, they are plagued by a lack of robustness when faced with dynamic objects or changes in illumination. This is particularly apparent in outdoor datasets where scenes are highly variable e.g. due to moving vehicles or pedestrians. To tackle this lack of robustness, further techniques have considered using multiple images as input to the network, with the premise being that the network can learn to reject temporally inconsistent features across frames. Examples include Vid Loc (Clark et al. 2017) and the recent Map Net (Brahmbhatt et al. 2018) which achieves state-of-the-art performance in camera pose regression. In this work we pursue an alternative approach to achieve robust camera localization and ask if we can achieve or even surpass the performance of multi-frame, sequential techniques by learning to attentively focus on parts of the image that are temporally consistent and informative e.g. buildings, whilst ignoring dynamic parts like vehicles and pedestrians, using a single image as input, as shown in Figure 1. We propose At Loc - Attention Guided Camera Localization, an

Visual Encoder

Pose Regressor

Rotation Position

Attention Map

Attention-guided Feature Map

Feature Map

New Feature Map

Attention Module

Figure 2: An overview of our proposed At Loc framework, consisting of Visual Encoder (extracts features from a single image), Attention Module (computes the attention and reweights the features), and Pose Regressor (maps the new features into the camera pose).

attention based pose regression framework to recover camera pose. Unlike previous methods, our proposed At Loc does not require sequential (multiple) frame nor geometry constraints designed and enforced by humans. We show that our model outperforms previous techniques, and achieves state-of-the-art results in common benchmarks. It works efﬁciently across both indoor and outdoor scenarios and is simple and end-to-end trainable without requiring any hand-crafted geometric loss functions. We provide detailed insight into how incorporating attention allows the network to achieve accurate and robust camera localization. The main contributions of this work are as follows: We propose a novel self-attention guided neural network for single image camera localization, allowing accurate and robust camera pose estimation. By visualizing the feature salience map after the attention, we show how our attention mechanism encourages the framework to learn stable features. Through extensive experiments in both indoor and outdoor scenarios, we show that our model achieves state-of-theart performance in pose regression, even outperforming multiple frame (sequential) methods.

Related Work Deep Neural Networks for Camera Localization Recent attempts have investigated the camera localization using deep neural networks (DNNs). Compared with traditional structure-based methods (Chen et al. 2011; Torii et al. 2013; Liu, Li, and Dai 2017) and image retrieval-based methods (Li et al. 2012; Sattler, Leibe, and Kobbelt 2012; Arandjelovic et al. 2016), DNN-based camera localization methods can automatically learn features from data rather than building a map or a database of landmark features by hand (Sattler et al. 2019). As the seminal work in this vein, Pose Net(Kendall, Grimes, and Cipolla 2015) is the ﬁrst one to adopt deep neural network to estimate camera pose from a single image. This approach is then extended by leveraging RNNs (e.g. LSTM) to spatially (Walch et al. 2017;

Wang et al. 2018a) and temporally (Clark et al. 2017) improve localization accuracy. Later on, localization performance is further improved by estimating uncertainty of the global camera pose with Bayesian CNN (Kendall and Cipolla 2016; Cai, Shen, and Reid 2018) and replacing feature extraction architecture with Residual Neural Network (Melekhov et al. 2017). However, the aforementioned approaches rely on the hand-tuned scale factor to balance the position and rotation losses during learning process. To address this issue, a learning weighted loss and a geometric reprojection loss (Kendall and Cipolla 2017) are introduced to produce more precise results. Recent efforts additionally leverage the geometric constraints from paired images (Brahmbhatt et al. 2018; Huang et al. 2019), augment training data by synthetic generation(Purkait, Zhao, and Zach 2018) or introduce posegraph optimization with neural graph model(Parisotto et al. 2018). Instead of imposing the temporal information or geometry constraints as the previous work, we developed an attention mechanism for DNN-based camera localization to self-regulate itself, and automatically learn to constrain the DNNs to focus on geometrically robust features. Our model outperforms previous approaches, and achieves state-of-theart results in common benchmarks.

Attention Mechanism Our work is related with the selfattention mechanisms, which have been widely embedded in various models capturing long-term dependencies (Bahdanau, Cho, and Bengio 2014; Xu et al. 2015; Yang et al. 2019). Self-attention was initially designed for machine translation (Vaswani et al. 2017; Dou et al. 2019; Cheng, Dong, and Lapata 2016), achieving the state-of-the-art performance. It is also integrated with an autoregressive model to generate image as Image Transformer (Parmar et al. 2018; Kingma and Dhariwal 2018). Another usage is to be formalized as a non-local operation to capture the spatial-temporal dependencies in video sequences (Wang et al. 2018b; Yuan, Mei, and Zhu 2019). A similar non-local architecture was introduced to Generative Adversarial Networks (GANs) for

extracting global long-range dependencies (Zhang et al. 2018; Liu et al. 2019). (Parisotto et al. 2018) used an attentionbased recurrent neural network for back-end optimization in a SLAM system, but not for camera relocalization. Despite its successes in a wide range of of computer vision (Fu et al. 2019; Chen et al. 2019) and natural language process tasks, self-attention has never been explored in camera pose regression. Our work integrated non-local style self-attention mechanism into the camera localization model to show the effectiveness of correlating robust key features and improve model performance.

Attention Guided Camera Localization

This section introduces Attention Guided Camera Localization (At Loc), an self-attention based deep neural network architecture to learn camera poses from a single image. Figure 2 illustrates a modular overview of the proposed framework, consisting of a visual encoder, an attention module and a pose regressor. The scene of a single image is compressed into an implicit representation by the visual encoder. Conditioned on the extracted features, the attention module computes the self-attention maps to re-weight the representation into a new feature space. The pose regressor further maps the new features after the attention operators into the camera pose, i.e. the 3-dimensional location and 4-dimensional quaternion (orientation).

Visual Encoder

The visual encoder serves to extract features that are necessary for the pose regression task, from a single monocular image. Previous works(Kendall and Cipolla 2017; Brahmbhatt et al. 2018) showed successful applications of the classical convolutional neural network (CNN) architectures in camera pose estimation, e.g. Google Net (Szegedy et al. 2015) and Res Net (He et al. 2016). Among them, the Res Net based (Brahmbhatt et al. 2018) frameworks achieved more stable and precise localization results over other architectures, due to the fact that the residual networks allow to train deeper layers of neural networks and reduce the gradient vanishing problems. Therefore, we considered to adopt a residual network with 34 layers (Res Net34) as the foundation for the visual encoder in the proposed At Loc model. Here, the weights of Res Net34 were initialized with the Res Net34 pretrained with the image classiﬁcation on Image Net dataset (Deng et al. 2009). To encourage learning meaningful features for a pose regression, the base network is further modiﬁed by replacing the ﬁnal 1000 dimensional fully-connected layer with a C dimensional fully-connected layer and removing the Softmax layers used for classiﬁcation. C is the dimension of the output feature. Considering the efﬁciency and performance of the model, the dimension is chosen as C = 2048. Given an image I R ˆ C H W , the features x RC can be extracted via the visual encoder fencoder:

x = fencoder(I) (1)

Attention Module Although the Res Net34 based visual encoder is capable of automatically learning the necessary features for camera localization, the neural network trained in certain speciﬁc scenes can be overﬁtted into the featureless appearance or the environmental dynamics. This will impact the generalization capacity of the model, and degrade the model performance in testing sets, especially in the outdoor scenarios due to the moving vehicles or weather change. Unlike the previous trials by introducing the temporal information (Clark et al. 2017) or geometric constraints (Brahmbhatt et al. 2018), we propose to adapt a self-attention mechanism into our framework. As the Figure 2 shown, this self-attention module is conditioned the features extracted by the visual encoder, and generates an attention map to enforce the model to focus on stable and geometry meaningful features. It is able to self-regulate itself without any hand-engineering geometry s or prior information. We adopt a non-local style self-attention, which has been applied in video analysis (Wang et al. 2018b) and image generation (Zhang et al. 2018), in our attention module. This aims to capture the long-range dependencies and global correlations of the image features, which will help generate better attention-guided feature maps from widely separated spatial regions (Wang et al. 2018b). The features x RC extracted by the visual encoder are ﬁrst used to compute the dot-product similarity between two embedding spaces θ(xi) and φ(xj):

S(xi, xj) = θ(xi)T φ(xj), (2)

where embeddings θ(xi) = Wθxi and φ(xj) = Wφxj linearly transform features at the position i and j into two feature spaces respectively. The normalization factor C is deﬁned as the C(xi) =

j S(xi, xj) with all feature position j. Given another linear transformation g(xj) = Wgxj, the output attention vector y is calculated via:

yi = 1 C(xi)

j S(xi, xj)g(xj), (3)

where the attention vector yi indicates to what extent the neural model focuses on the features xi at the position i. Finally, the self-attention of input features x can be written as: y = Softmax(x T WT θ Wφx)Wgx (4) Furthermore, we add a residual connection back to a linear embedding of the self-attention vectors:

Att(x) = α(y) + x, (5)

where the linear embedding α(y) = Wαy outputs a scaled self-attention vectors with learnable weights Wα. In our proposed model, fully-connected layers are implemented to generate learned weight matrices Wθ, Wφ, Wg and Wα in space (C/n), where C is the number of channels of the input feature x and n is the downsampling ratio for the attention maps. Based on extensive experiments, we found that n = 8 performs best across different datasets.

Pose Net Bayesian Pose Net Hourglass Pose Net17 At Loc Scene Pose Net Spatial LSTM (Ours) Chess 0.32m, 6.60 0.37m, 7.24 0.24m, 5.77 0.15m, 6.17 0.13m, 4.48 0.10m, 4.07 Fire 0.47m, 14.0 0.43m, 13.7 0.34m, 11.9 0.27m, 10.8 0.27m, 11.3 0.25m, 11.4 Heads 0.30m, 12.2 0.31m, 12.0 0.21m, 13.7 0.19m, 11.6 0.17m, 13.0 0.16m, 11.8 Ofﬁce 0.48m, 7.24 0.48m, 8.04 0.30m, 8.08 0.21m, 8.48 0.19m, 5.55 0.17m, 5.34 Pumpkin 0.49m, 8.12 0.61m, 7.08 0.33m, 7.00 0.25m, 7.01 0.26m, 4.75 0.21m, 4.37 Kitchen 0.58m, 8.34 0.58m, 7.54 0.37m, 8.83 0.27m, 10.2 0.23m, 5.35 0.23m, 5.42 Stairs 0.48m, 13.1 0.48m, 13.1 0.40m, 13.7 0.29m, 12.5 0.35m, 12.4 0.26m, 10.5 Average 0.45m, 9.94 0.47m, 9.81 0.31m, 9.85 0.23m, 9.53 0.23m, 8.12 0.20m, 7.56

Table 1: Camera localization results on 7 Scenes (Without temporal s). For each scene, we compute the median errors in both position and rotation of various single-image based baselines and our proposed method.

Table 2: Training and testing Sequences of Oxford Robot Car. LOOP is a relatively shorter subset (1120m in total length) and FULL covers a length of 9562m.

Sequence Time Tag Mode 2014-06-26-08-53-56 overcast Training 2014-06-26-09-24-58 overcast Training LOOP1 2014-06-23-15-41-25 sunny Testing LOOP2 2014-06-23-15-36-04 sunny Testing 2014-11-28-12-07-13 overcast Training 2014-12-02-15-30-08 overcast Training FULL1 2014-12-09-13-21-02 overcast Testing FULL2 2014-12-12-10-45-15 overcast Testing

Learning Camera Pose The pose regressor maps the attention guided features Att(x) to location p R3 and quaternion q R4 respectively through Multilayer Perceptrons (MLPs):

[p, q] = MLPs(Att(x)) (6)

Given training images I and their corresponding pose labels [ˆp, ˆq] represented by the camera position ˆp R3 and a unit quaternion ˆq R4 for orientation, the parameters inside the neural networks are optimized with L1 Loss via the following loss function (Brahmbhatt et al. 2018):

loss(I) = p ˆp 1e β +β+ log q log ˆq 1e γ +γ (7)

where β and γ are the weights that balance the position loss and rotation loss. log q is the logarithmic form of an unit quaternion q, which is deﬁned as:

v v cos 1 u, if v = 0 0, otherwise (8)

Here, u denotes the real part of an unit quaternion while v is its imaginary part. For all scenes, both β and γ are simultaneously learned during training with approximate initial values of β0 and γ0. In camera pose regression tasks, quaternions are widely used to represent the orientation due their ease of formulation in a continuous and differentiable way. By normalizing any 4D quaternions to unit length, we can easily

Table 3: Camera localization results on 7 Scenes (With temporal Constraints). For each scene, we compare the median errors in both position and rotation of Vid Loc, Map Net and our approach.

Vid Loc Map Net At Loc+ Scene (Ours) Chess 0.18m, NA 0.08m, 3.25 0.10m, 3.18 Fire 0.26m, NA 0.27m, 11.7 0.26m, 10.8 Heads 0.14m, NA 0.18m, 13.3 0.14m, 11.4 Ofﬁce 0.26m, NA 0.17m, 5.15 0.17m, 5.16 Pumpkin 0.36m, NA 0.22m, 4.02 0.20m, 3.94 Kitchen 0.31m, NA 0.23m, 4.93 0.16m, 4.90 Stairs 0.26m, NA 0.30m, 12.1 0.29m, 10.2 Average 0.25m, NA 0.21m, 7.77 0.19m, 7.08

map any rotations in 3D space to valid unit quaternions. But this has one main issue: quaternions are not unique. In practice, both q and q can represent the same rotation because a single rotation can be mapped to two hemispheres. To ensure that each rotation only has a unique value, all quaternions are restricted to the same hemisphere in this paper.

Temporal Constraints

Sharing the same ﬂavor with geometry-aware learning methods (Brahmbhatt et al. 2018; Xue et al. 2019; Huang et al. 2019), we extend our proposed At Loc to At Loc+ by incorporating temporal constraints between image pairs. Intuitively, temporal constraints can enforce the learning of globally consist features, and thereby improve the overall localization accuracy. In this work, the loss considering temporal constraints is deﬁned as:

loss(Itotal) = loss(Ii) + α

i =j loss(Iij) (9)

where i and j indicate the index of images. Iij = (pi pj, qi qj) represents the relative pose between images Ii and Ij. α denotes the weight coefﬁcient between the loss of the absolute pose from a single image and the relative pose from image pairs.

Figure 3: Saliency maps of two scenes selected from Chess. Each scene contains the saliency maps generated by Pose Net (left) and At Loc (right) using attention.

Experiments To train the proposed network consistently on different datasets, we rescale the images such that the shorter side is of length 256 pixels. The input images are then normalized to have pixel intensities within the range -1 to 1. The Res Net34 (He et al. 2016) component in our network is initialized by using a pretrained model on the Image Net dataset while the remaining components follow random initialization. 256 256 pixels images are cropped for our network during both the training and testing phase with random and central cropping strategy respectively. For the training on Oxford Robot Car dataset, random Color Jitter is additionally applied when performing data augmentation, with values of 0.7 for brightness, contrast and saturation setting and 0.5 for hue. We note that this augmentation step is essential to improve the generalization ability of model over various weather and time-of-day conditions. We implement our approaches with Py Torch, using the ADAM solver (Kingma and Ba 2014) and an initial learning rate of 5 10 5. The network is trained on a NVIDIA Titan X GPU with the following hyperparameters: mini-batch size of 64, dropout rate probability of 0.5 and weight initializations of β0 = 0.0 and γ0 = 3.0. When introducing temporal constraints, we sample consecutive triplets every 10 frames with α0 = 1.0 and initialize weight coefﬁcient α0 = 1.0.

Datasets and Baselines 7 Scenes (Shotton et al. 2013) is a dataset consisting of RGB-D images from seven different indoor scenes captured

3RVH1HW $W/RF 2XUV

Figure 4: Saliency maps of two scenes selected from Oxford Robot Car generated from models without attention (left: Pose Net+) and with attention (right: At Loc). Note how At Loc learns to ignore visually uninformative features e.g. the road in the top ﬁgure and instead focus on more distinctive objects e.g. the skyline in the distance. At Loc also learns to reject affordable objects e.g. the bicycles in the bottom ﬁgure, yielding more robust global localization.

by a handheld Kinect RGB-D camera. The corresponding ground truth camera poses were calculated using Kinect Fusion. All images were captured in a small-scale indoor ofﬁce environment at the resolution of 640 480 pixels. Each scene contains two to seven sequences in a single room for training/testing, with 500 or 1000 images for each sequence. As a a popular dataset for visual relocalization, the sequences contained in this dataset were recorded under various camera motion status and different conditions, e.g. motion blur, perceptual aliasing and textureless features in the room.

Oxford Robot Car (Maddern et al. 2017) was recorded by an autonomous Nissan LEAF car in Oxford, UK over several periods for a year. This dataset exhibits substantial observations in the presence of various weather conditions, such as sunny and snowy days, as well as different lighting conditions, e.g., dim and glare roadworks. Moreover, we also found many dynamic or affordable objects in the scenes (e.g., parked/moving vehicles, cyclists and pedestrians), making this dataset particularly challenging for vision-based relocalization tasks. For a fair comparison, we follow the same evaluation strategy of Map Net (Brahmbhatt et al. 2018; Xue et al. 2019) and use two subsets of this dataset in our experiments, labelled as LOOP and FULL (length-based)

Sequence Pose Net+ At Loc (Ours) Map Net At Loc+ (Ours) Mean Median Mean Median Mean Median Mean Median LOOP1 25.29m, 17.45 6.88m, 2.06 8.61m, 4.58 5.68m, 2.23 8.76m, 3.46 5.79m, 1.54 7.82m, 3.62 4.34m, 1.92 LOOP2 28.81m, 19.62 5.80m, 2.05 8.86m, 4.67 5.05m, 2.01 9.84m, 3.96 4.91m, 1.67 7.24m, 3.60 3.78m, 2.04 FULL1 125.6m, 27.10 107.6m, 22.5 29.6m, 12.4 11.1m, 5.28 41.4m, 12.5 17.94m, 6.68 21.0m, 6.15 6.40m, 1.50 FULL2 131.1m, 26.05 101.8m, 20.1 48.2m, 11.1 12.2m, 4.63 59.3m, 14.8 20.04m, 6.39 42.6m, 9.95 7.00m, 1.48 Average 77.70m, 22.56 55.52m, 11.7 23.8m, 8.19 8.54m, 3.54 29.8m, 8.68 12.17m, 4.07 19.7m, 5.83 5.38m, 1.74

Table 4: Camera localization results on the LOOP and FULL of the Oxford Robot Car. For each sequence, we calculate the median and mean errors of position and rotation of Posenet+, Map Net and our approaches. Posenet and At Loc leverage a single image while Map Net and At Loc+ utilize sequential ones.

0DS1HW $W/RF 2XUV 6LQJOH

$W/RF 2XUV 6HTXHQWLDO

Figure 5: Trajectories on LOOP1 (top), LOOP2 (middle) and FULL1 (bottom) of Oxford Robot Car. The ground truth trajectories are shown in black lines while the red lines are the predictions. The star in the trajectory represents the starting point.

respectively. More details about these two sequences can be found in Table 2 In terms of implementation, we take the images recorded by the centre camera at a resolution of 1280 960 as the input to our network. The corresponding ground truth poses are labelled by the interpolations of INS measurements.

Baselines To validate the performance of our proposed network, we compare the results of several competing approaches. For experiments on 7 Scenes, we choose the following mainstream single-image-based methods: Pose Net (Kendall, Grimes, and Cipolla 2015), Bayesian Pose Net (Kendall and Cipolla 2016), Pose Net Spatial-LSTM (Walch et al. 2017), Hourglass(Melekhov et al. 2017) and Pose Net17 (Kendall and Cipolla 2017). Moreover, we also report the results of temporal approaches, Vid Loc(Clark et al. 2017) and Map Net, for comparison. For the outdoor Oxford Robot Car dataset, Stereo VO (Maddern et al. 2017) and Pose Net+ (aka. Res Net34+log q) (Brahmbhatt et al. 2018) are selected

as our baselines. It is worth mentioning that Post Net+ is the best variant of Pose Net (Kendall, Grimes, and Cipolla 2015) on the Robot Car dataset (Brahmbhatt et al. 2018). Lastly, we also report the performance of Map Net(Brahmbhatt et al. 2018), the state-of-art method on this dataset using a sequence of images for relocalization. Note that as sequence based methods can exploit temporal constraints, they generally perform better than single-image based approaches. We nevertheless still compare with Map Net in evaluation to examine how accurate our single-image based At Loc is.

Experiments on 7 Scenes 7Scenes Dataset contains 7 static indoor scenes with a large number of images captured in an ofﬁce building. We take all scenes for comprehensive performance evaluation.

Quantitative Results Table 1 and Table 3 summarize the performance of all methods. Clearly, we can see that our method outperforms other single-image-based methods, with

a 13% improvement in position accuracy and a 7% improvement in rotation than the best single-image based baseline. In particular, At Loc achieves the best performance gain in large texture-less (such as whiteboard) and highly texturerepetitive (such as stairs) scenarios. At Loc reduces the position error from 0.35m to 0.26m and the rotation error from 12.4 to 10.5 in the scene of Stairs, which is a signiﬁcant improvement over prior arts. In other regular scenes, At Loc still reaches a comparable accuracy against baselines. By using only a single image, At Loc achieves a superior accuracy compared with Map Net, despite the uses of image sequences and handcrafted geometric constraints in the Map Net design. Last but not least, after incorporating temporal constraints, At Loc+ further narrows the median position and rotation errors to 0.19m and 7.08 respectively, outperforming Map Net by a large margin.

Qualitative Results To deeply understand the reasons behind these improvements, we visualize the attention maps of some scenes from 7Scenes. As shown in Figure 3, by using attention, At Loc focuses more on geometrically meaningful areas (e.g. key points and lines) rather than feature-less regions and shows better consistency over time. In contrast, the saliency maps of Pose Net are relatively scattered and tend to focus on random regions in the view. A video that compares the saliency map between Pose Net and At Loc in detail can be found at https://youtu.be/ x Ob J1xwt94.

Localization Results on Oxford Robot Car We next evaluate our approach on Oxford Robot Car dataset. Due to the substantial dynamics over the long collection period, this dataset is very challenging and strictly demands high robustness and adaptability of a recloazalition model.

Quantitative Results Table 4 shows the comparison of our methods against Pose Net+ and Map Net. Compared with Pose Net+, At Loc presents signiﬁcant improvements on both LOOP trajectories and FULL trajectories. The mean position accuracy is improved from 25.29m to 8.61m on LOOP1, and 28.81m to 8.86m on LOOP2. The largest performance gains are observed on FULL1 and FULL2, where our approach outperforms Pose Net+ by 76.5% and 63.3%. When compared against the sequence-based Map Net, our At Loc has obvious accuracy gain on all cases. Even for the unfavorable routes (FULL1 and FULL2), At Loc still provides 28.5% and 30.3% improvements over Map Net. Futhermore, At Loc+ signiﬁcantly improves the position accuracy to 19.7m and the rotation accuracy to 5.83 after temporal constraints introduced.

Qualitative Results We now investigate why At Loc significantly outperforms baselines on the Oxford Robo Car dataset. In Figure 5, we plot the predictions of LOOP1 (top), LOOP2 (middle) and FULL1 (bottom) by Stereo VO, Posenet+, Map Net, At Loc and At Loc+. Stereo VO is the ofﬁcial baseline from Oxford Robot Car. Although Stereo VO has very smooth predicted trajectories, it suffers from signiﬁcant drifts as route length increases. Due to strong local similarity, there are

Scene At Loc Basic At Loc Basic+LSTM At Loc At Loc+

Chess 0.11m, 4.29 0.13m, 4.26 0.10m, 4.07 0.10m, 3.18 Fire 0.29m, 12.1 0.27m, 11.7 0.25m, 11.4 0.26m, 10.8 Heads 0.19m, 12.2 0.16m, 12.3 0.16m, 11.8 0.14m, 11.4 Ofﬁce 0.19m, 6.35 0.20m, 5.74 0.17m, 5.34 0.17m, 5.16 Pumpkin 0.22m, 5.05 0.26m, 4.19 0.21m, 4.37 0.20m, 3.94 Kitchen 0.25m,5.27 0.19m, 4.63 0.23m, 5.42 0.16m, 4.90 Stairs 0.30m, 11.3 0.29m, 12.1 0.26m, 10.5 0.29m, 10.2 Average 0.22m, 8.07 0.21m, 7.83 0.20m, 7.56 0.19m, 7.08 Robot Car LOOP1 25.29m, 17.45 29.71m, 15.72 8.61m, 4.58 7.82m, 3.62 LOOP2 28.81m, 19.62 32.79m, 17.76 8.86m, 4.67 7.24m, 3.60 FULL1 125.6m, 27.10 48.29m, 17.18 29.6m, 12.4 21.0m, 6.15 FULL2 131.1m, 26.05 67.62m, 11.40 48.2m, 11.1 42.6m, 9.95 Average 77.70m, 22.56 44.60m, 15.52 23.8m, 8.19 19.7m, 5.83

Table 5: Ablation study of At Loc on 7 Scenes and Oxford Robot Car. At Loc (Basic) denotes the model without using attention.

many outliers predicted by Pose Net+. These outliers, however, are signiﬁcantly reduced in by At Loc and At Loc+. By looking into the saliency maps (Figure 4), we found Pose Net+ heavily relies on texture-less regions, such as local road surface (top), dynamic cars (middle) and affordance objects such as bicycles (bottom). These regions are either too similar in appearance or unreliable due to changes overtime, making pose estimation difﬁcult. By contrast, our attention-guided At Loc is able to automatically focus on unique, static and stable areas/objects, including vanishing lines and points (top), buildings (middle and bottom). These areas are tightly related to the latent geometric features of an environment, enabling robust pose estimation in the wild. To further understand the efﬁcacy of the attention mechanism, we depict the feature distances for a sequence of images. Speciﬁcally, we select a starting frame in the trajectory and then calculate feature distances (L2) of subsequent frames to the starting frame. Features are extracted by Pose Net+ and At Loc respectively, with the intention to understand to what extent the attention mechanism can help extract robust features. For experiments, we plot the distance proﬁle under two cases: (i) dynamic vehicles and (ii) changing illumination. As we can see in Fig.6 (left), when the camera is static (i.e., the data-collection car is not moving), Pose Net+ is sensitive to dynamic objects entering the scene, resulting in a large variation of distances. In contrary, thanks to the adopted attention mechanism, At Loc is robust to these moving vehicles and provides more stable features overall. Distance spikes are only observed when a large truck enters/leaves the scene, in which a substantial portion of view is blocked/revealed to the camera. On the right side of Fig.6, the extracted features of Pose Net+ suffer from illumination changes and gives abrupt shifts under different levels of glare. The features extracted by At Loc, however, consistently change as the camera moves forward, agnostic to various lighting conditions.

Ablation Study and Efﬁciency Evaluation We conduct an ablation study on the introduced attention module above 7 scenes and Oxford Robo Car datasets. In Table 5 , At Loc is compared with a basic version without

6WDUW FOHDQ

6WDUW QRUPDO

6WURQJ *ODUH

$ '\QDPLF 9HKLFOHV % &KDQJLQJ ,OOXPLQDWLRQ

Figure 6: Feature distance comparisons under different dynamic disturbances. (a) Dynamic vehicles and (b) Changing illumination. Feature distances of At Loc reasonably change with the motion status of the camera and are agnostic to various dynamics, while Pose Net suffers in both experiments.

the attention module, a LSTM version by replacing the attention module with the LSTM module, and a sequence enhanced version with temporal constraints. The rest modules are kept as the same for a fair comparison. The comparison of At Loc (Basic) and At Loc indicates that the model performance clearly increases on both datasets by adopting the self-attention into the pose regression model: it shows a 9% improvement in location accuracy and 6% in rotation accuracy on the 7 Scenes dataset; At Loc achieves an average localization accuracy of 23.8m and an average rotation accuracy of only 8.19 on Oxford Robot Car dataset. The LSTM based At Loc is only better than Basic At Loc. With temporal constraints, At Loc+ yields the best performance, obtaining a median error of 0.19m, 7.08 on 7 Scenes, and 19.7m, 5.83 on Robot Car.

To evaluate the efﬁciency of our proposed At Loc, we analyze the average running time of three models - Map Net, Pose LSTM and At Loc. Among the three models, Map Net consumes the longest running time of 9.4ms per frame, as it needs to process additional data from other sensory inputs and a sequence of images to apply geometric constraints. Due to the time-consuming recursive operations in LSTMs, Pose LSTM takes a running time of 9.2ms per frame, 3.7ms higher than its corresponding basic model Pose Net. In contrast, our proposed At Loc achieves an ideal balance between the computational efﬁciency and localization accuracy, consuming only 6.3ms per frame while obtaining the best localization

performance.

Conclusion and Discussion

Camera localization is a challenging task in computer vision due to scene dynamics and high variability of environment appearance. In this work, we presented a novel study of self-attention guided camera localization from a single image. The introduced self-attention can encourage the framework to learn geometrically robust features, mitigating the impacts from dynamic objects and changing illumination. We demonstrate state-of-the-art results, even surpassing sequential based techniques in challenging scenarios. Further work includes reﬁning the attention module and determining whether it can improve multi-frame camera pose regression.

Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; and Sivic, J. 2016. Netvlad: Cnn architecture for weakly supervised place recognition. In CVPR. Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. In ICLR. Brachmann, E.; Krull, A.; Nowozin, S.; Shotton, J.; Michel, F.; Gumhold, S.; and Rother, C. 2017. Dsac-differentiable ransac for camera localization. In CVPR. Brahmbhatt, S.; Gu, J.; Kim, K.; Hays, J.; and Kautz, J. 2018.

Geometry-aware learning of maps for camera localization. In CVPR. Cai, M.; Shen, C.; and Reid, I. D. 2018. A hybrid probabilistic model for camera relocalization. In BMVC. Chen, D. M.; Baatz, G.; K oser, K.; Tsai, S. S.; Vedantham, R.; Pylv an ainen, T.; Roimela, K.; Chen, X.; Bach, J.; Pollefeys, M.; et al. 2011. City-scale landmark identiﬁcation on mobile devices. In CVPR. Chen, C.; Rosa, S.; Miao, Y.; Lu, C. X.; Wu, W.; Markham, A.; and Trigoni, N. 2019. Selective sensor fusion for neural visual-inertial odometry. In CVPR. Cheng, J.; Dong, L.; and Lapata, M. 2016. Long short-term memory-networks for machine reading. In EMNLP. Clark, R.; Wang, S.; Markham, A.; Trigoni, N.; and Wen, H. 2017. Vidloc: A deep spatio-temporal model for 6-dof video-clip relocalization. In CVPR. Deng, J.; Dong, W.; Socher, R.; Li, L.; Kai Li; and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. Dou, Z.-Y.; Tu, Z.; Wang, X.; Wang, L.; Shi, S.; and Zhang, T. 2019. Dynamic layer aggregation for neural machine translation with routing-by-agreement. In AAAI. Fu, Y.; Wang, X.; Wei, Y.; and Huang, T. 2019. Sta: Spatialtemporal attention for large-scale video-based person reidentiﬁcation. In AAAI. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR. Huang, Z.; Xu, Y.; Shi, J.; Zhou, X.; Bao, H.; and Zhang, G. 2019. Prior guided dropout for robust visual localization in dynamic environments. In ICCV. Kendall, A., and Cipolla, R. 2016. Modelling uncertainty in deep learning for camera relocalization. In ICRA. Kendall, A., and Cipolla, R. 2017. Geometric loss functions for camera pose regression with deep learning. In CVPR. Kendall, A.; Grimes, M.; and Cipolla, R. 2015. Posenet: A convolutional network for real-time 6-dof camera relocalization. In CVPR. Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. In ICLR. Kingma, D. P., and Dhariwal, P. 2018. Glow: Generative ﬂow with invertible 1x1 convolutions. In NIPS. Li, Y.; Snavely, N.; Huttenlocher, D.; and Fua, P. 2012. Worldwide pose estimation using 3d point clouds. In ECCV. Liu, A.; Liu, X.; Fan, J.; Ma, Y.; Zhang, A.; Xie, H.; and Tao, D. 2019. Perceptual-sensitive gan for generating adversarial patches. In AAAI. Liu, L.; Li, H.; and Dai, Y. 2017. Efﬁcient global 2d-3d matching for camera localization in a large-scale 3d map. In ICCV. Maddern, W.; Pascoe, G.; Linegar, C.; and Newman, P. 2017. 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research 36(1):3 15. Melekhov, I.; Ylioinas, J.; Kannala, J.; and Rahtu, E. 2017. Image-based localization using hourglass networks. In ICCV.

Parisotto, E.; Singh Chaplot, D.; Zhang, J.; and Salakhutdinov, R. 2018. Global pose estimation with an attention-based recurrent network. In CVPR Workshops. Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, Ł.; Shazeer, N.; Ku, A.; and Tran, D. 2018. Image transformer. In ICML. Purkait, P.; Zhao, C.; and Zach, C. 2018. Synthetic view generation for absolute pose regression and image synthesis. In BMVC. Sattler, T.; Zhou, Q.; Pollefeys, M.; and Leal-Taixe, L. 2019. Understanding the limitations of cnn-based absolute camera pose regression. In CVPR. Sattler, T.; Leibe, B.; and Kobbelt, L. 2012. Improving image-based localization by active correspondence search. In ECCV. Shotton, J.; Glocker, B.; Zach, C.; Izadi, S.; Criminisi, A.; and Fitzgibbon, A. 2013. Scene coordinate regression forests for camera relocalization in rgb-d images. In CVPR, 2930 2937. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich, A. 2015. Going deeper with convolutions. In CVPR. Torii, A.; Sivic, J.; Pajdla, T.; and Okutomi, M. 2013. Visual place recognition with repetitive structures. In CVPR. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In NIPS. Walch, F.; Hazirbas, C.; Leal-Taixe, L.; Sattler, T.; Hilsenbeck, S.; and Cremers, D. 2017. Image-based localization using lstms for structured feature correlation. In CVPR. Wang, S.; Clark, R.; Wen, H.; and Trigoni, N. 2018a. Endto-end, sequence-to-sequence probabilistic visual odometry through deep neural networks. The International Journal of Robotics Research 37(4-5):513 542. Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018b. Nonlocal neural networks. In CVPR. Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML. Xue, F.; Wang, X.; Yan, Z.; Wang, Q.; Wang, J.; and Zha, H. 2019. Local supports global: Deep camera relocalization with sequence enhancement. In ICCV. Yang, B.; Li, J.; Wong, D. F.; Chao, L. S.; Wang, X.; and Tu, Z. 2019. Context-aware self-attention networks. In AAAI. Yuan, Y.; Mei, T.; and Zhu, W. 2019. To ﬁnd where you talk: Temporal sentence localization in video with attention based location regression. In AAAI. Zhang, H.; Goodfellow, I.; Metaxas, D.; and Odena, A. 2018. Self-attention generative adversarial networks. In ICML.