# stereo_neural_vernier_caliper__ebc61aeb.pdf

Stereo Neural Vernier Caliper

Shichao Li*1, Zechun Liu1,2, Zhiqiang Shen2,3,1, Kwang-Ting Cheng1

1Hong Kong University of Science and Technology 2Carnegie Mellon University 3Mohamed bin Zayed University of Artiﬁcial Intelligence slicd@cse.ust.hk, timcheng@ust.hk

We propose a new object-centric framework for learningbased stereo 3D object detection. Previous studies build scene-centric representations that do not consider the signiﬁcant variation among outdoor instances and thus lack the ﬂexibility and functionalities that an instance-level model can offer. We build such an instance-level model by formulating and tackling a local update problem, i.e., how to predict a reﬁned update given an initial 3D cuboid guess. We demonstrate how solving this problem can complement scene-centric approaches in (i) building a coarse-to-ﬁne multi-resolution system, (ii) performing model-agnostic object location reﬁnement and, (iii) conducting stereo 3D tracking-by-detection. Extensive experiments demonstrate the effectiveness of our approach, which achieves state-of-the-art performance on the KITTI benchmark. Code and pre-trained models are available at https://github.com/Nicholasli1995/SNVC.

Introduction

Accurate perception of surrounding objects 3D attributes is indispensable for autonomous driving, robot navigation, and trafﬁc surveillance. Active range sensors such as Li DAR measures the 3D scene geometry directly to perform precise 3D localization (Lang et al. 2019; Shi, Wang, and Li 2019). However, Li DAR sensors incur a high cost and can be limited in perception range where distant objects are only captured with very few points. On the other hand, passive sensors like cameras are inexpensive, yet the depth information is lost during the image formation process which makes 3D scene understanding a challenging inverse problem. Estimating depth from a single RGB image is ill-posed and leads to limited 3D object detection performance (Brazil and Liu 2019; Wang et al. 2021a; Lu et al. 2021). Stereo cameras, simulating a binocular human vision system, are the minimum sensor conﬁguration that can exploit multi-view geometry for more reliable depth inference. Studying stereo 3D object detection (S3DOD), thus not only is a pursuit of the vision community that aims at visual scene understanding, but also offers practical value to complement active sensors through multi-sensor fusion (Liang et al. 2019).

*The correspondence author is Shichao Li. Copyright 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Figure 1: Stereo Neural Vernier Caliper (SNVC) consists of the main scale network M that models a coarse global scene and the Vernier scale network V that models a ﬁne local scene. M predicts coarse location pc while V takes a closer look around pc and makes ﬁner measurements.

Recent state-of-the-art (SOTA) S3DOD approaches take a scene-centric view and build a data representation for the whole scene. We use a representative work (Chen et al. 2020) as our baseline, which uses the estimated depth to build a voxel-based scene representation for object detection. In contrast to it, our study promotes an object-centric viewpoint and explores instance-level analysis for S3DOD. The following practical considerations motivate this study, which demand the attributes of this object-centric viewpoint that are not offered by the scene-centric counterpart.

Cost-accuracy trade-off considering the depth variation: a distant object usually has lower-resolution (LR) feature representation compared to a nearby one, which makes it difﬁcult to precisely recover its 3D attributes. Naively re-computing high-resolution (HR) features and using a ﬁner voxel grid for the whole scene lead to intimidating computational cost and cubic memory growth. This is also unnecessary if the model performance for nearby instances already saturates. If an instance-level model is available, a multi-resolution (MR) system can be built to mitigate this problem. This system beneﬁts from a coarse-to-ﬁne design which computes an LR global representation and focuses on the tiny instances with complementary HR features1. Capability for efﬁciently handling new frames: videos

1Such HR features can be obtained computationally or physically with an actively zooming camera (Bellotto et al. 2009).

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

are more prevalent in real-world scenarios than two static stereo images. The scene-centric approaches need to build a scene representation for every new pair of frames. With an instance-level model, one only needs to do so for some key frames and can conduct tracking-bydetection (Andriluka, Roth, and Schiele 2008), i.e., processing only a portion of new frames given region-ofinterests (Ro Is) implied by past detections. Flexibility in video applications: certain objects are more important in a 3D scene, e.g., a car heading towards a driver should draw more attention than a vehicle leaving the ﬁeld of view. Instance-level analysis can offer the ﬂexibility to prioritize certain regions when analyzing new frames.

To make an instance-level analysis model useful in all aforementioned scenarios, we can design it as a reﬁnement model in a recurrent manner. Given an initial 3D bounding box guess, the model should reason about the 3D space around the cuboid guess and give an updated prediction. Note the initial guess can vary for different user scenarios, e.g., it can be a proposal from an LR global model or the prediction of the last frame. Thus our research question is how to design such an instance-level reﬁnement model in the stereo perception setting? We tackle this problem by designing an instance-level neural network V which builds a voxel-based local scene representation and scores each voxel how likely it is a potential update of an object part. V can be combined with a global model M to form an MR S3DOD framework as illustrated in Fig. 1. M performs scene-level depth estimation and outputs coarse 3D object proposals. Conditioned on each 3D proposal, V further extracts HR features and reﬁnes its 3D attributes. We name this framework Stereo Neural Vernier Caliper (SNVC) since it resembles a Vernier caliper where M (the main scale) models the 3D scene with a coarse voxel grid while V (the Vernier scale) models an HR local scene conditioned on the initial guess. Compared to prior arts, our approach endows an S3DOD system with the advantages discussed above and can model ﬁne-grained structures for important regions with a tractable number of voxels. Such ability leads to superior detection performance, especially for the hard instances, i.e., the tiny and occluded ones. This paper s contributions are summarized as:

We propose the ﬁrst MR framework for voxel-based S3DOD. The new instance-level model V within it, to our best knowledge, is also the ﬁrst HR voxel representation learning approach tackling the local update problem. We study the transferability of V and demonstrate it can be used as a model-agnostic and plug-and-play reﬁnement module for S3DOD that complements many existing scene-centric approaches. SNVC out-performs previously published results on the KITTI benchmark for S3DOD at the date of submission (Sep 8th, 2021).

Related Work Our study is relevant to the following research directions while having distinct contributions.

Learning-based 3D object detection aims to learn a mapping from sensor input to 3D bounding box representations (Chen et al. 2015). Depending on the sensor modality, two parallel lines of research are vision-based methods (Roddick, Kendall, and Cipolla 2019; Ke et al. 2020; Weng and Kitani 2019; Wang et al. 2021b; Reading et al. 2021) and Li DAR-based approaches (Chen et al. 2017; Yan, Mao, and Li 2018; Zhou and Tuzel 2018; Qi et al. 2018; Li, Wang, and Wang 2021). Our approach lies in the former which does not require expensive range sensors. Compared with previous stereo vision-based studies (Chen et al. 2020; Garg et al. 2020) that focus on global scene modeling, we deal with a different local update problem and dedicate a new model for HR instance-level analysis. Our design can complement previous 3D object detectors that do not share the ﬂexibility and high precision offered by our approach. Instance-level analysis in 3D object detection builds a feature representation for an instance proposal to estimate its high-quality 3D attributes. FQ-Net (Liu et al. 2019) draws a projected cuboid proposal on an instance patch and regresses its 3D Intersection-over-Union (Io U) for location reﬁnement. RAR-Net (Liu et al. 2020) formulates a reinforcement learning framework for iterative instance pose reﬁnement. 3D-RCNN (Kundu, Li, and Rehg 2018) uses instance shape as auxiliary supervision yet requires extra annotations which are not needed by our approach. Notably, all these methods only consider the monocular case and cannot utilize stereo imagery. Zoom Net (Xu et al. 2020) and Disp RCNN (Sun et al. 2020) construct point-based representations for each instance proposal and require extra mask annotation during training. Such representations also lose the semantic features and are less robust for distant and occluded objects which have few foreground points. We instead propose to learn a voxel-based representation to encode both semantic and geometric features. Voxel-based representation is a classical and simple data structure encoding 3D features and is widely adopted in image-based rendering (Seitz and Dyer 1999) and multiview reconstruction (Vogiatzis, Torr, and Cipolla 2005). Early studies utilize hand-crafted features and energy-based models to encode prior knowledge (Snow, Viola, and Zabih 2000), while recent deep learning-based approaches (Choy et al. 2016; Riegler, Osman Ulusoy, and Geiger 2017) directly learn such representations from data. Our approach learns a voxel-based neural representation for the unique local update problem under the stereo perception setting. High-resolution neural networks was recently proposed (Sun et al. 2019) to model ﬁne-grained spatial structure details to beneﬁt tasks that involve precise localization. However, later studies only focus on monocular and 2D tasks (Wang et al. 2020; Cheng et al. 2020; Li et al. 2020, 2021). This work instead studies HR representation learning for S3DOD and can build an unprecedented ﬁne 3D spatial resolution of 3 centimeters in the real self-driving scenarios. Multi-resolution volumetric representation was previously studied for 3D shape representation (Riegler, Osman Ulusoy, and Geiger 2017) and reconstruction (Blaha et al. 2016) where less important region (e.g., free space) is represented with coarse voxel grid and a ﬁner voxel grid

is used for regions close to object surface adaptively. The octree (Laine and Karras 2010) was a popular choice to generate an MR partition of a 3D region. We argue that there is also a large variation of importance for different regions in S3DOD where regions near objects have a larger inﬂuence on the detection performance. We thus introduce the idea of varying resolution to voxel-based S3DOD for the ﬁrst time with a new MR system.

Cascaded 3D Object Detection SNVC can be formulated as a cascaded model represented as the set {M, V} consisting of the main scale network M and the Vernier scale network V. Given an RGB image pair (L, R) captured by calibrated stereo cameras, M builds a coarse representation of the global 3D scene and predicts q 3D bounding box proposals {pc q}N q=1 as

M(L, R; θM) = {pc q}N q=1. (1)

Conditioned on each coarse proposal pc q, V deals with the local update problem by constructing an HR local scene representation and infers an offset δpq to obtain the reﬁned pose pf q = pc q + δpq.

V(L, R; θV|pc q) = δpq. (2)

This framework is general and does not enforce any assumption on the architecture design of M. As we will show, one design of V can handle coarse predictions from different implementations of M and thus be model-agnostic. In light of this, we only detail our architecture of V in the main text. M used in our ﬁnal system is sketched in Fig. 3 whose architecture details are in the supplementary material (SM).

High-resolution Instance-level Update In this study, we propose to design V based on a three-step procedure:

Space partitioning: we build a dense voxel grid in a 3D Ro I conditioned on the coarse 3D proposal where each voxel is a candidate for a part location update. Deep voxel coloring: we aggregate high-level features extracted by a deep neural network for each voxel. Voting: we extract spatial structural information from the colored voxel grid and score each voxel how likely it is a location update of an object part. The pose update is then estimated via a vote based on the predicted object part locations and conﬁdences.

The ﬁrst step builds a very dense local 3D scene representation aiming at high-precision 3D object localization. In step two we study two strategies that correspond to two user scenarios and lead to two architecture variants. Our strategy in step three utilizes several pre-deﬁned cuboid parts for robust inference considering occlusion. The following subsections describe each step in detail. Candidate generation. Given a coarse 3D bounding box prediction pc q represented as a 7-tuple pc q = (xq, yq, zq, hq, wq, lq, θq), where (xq, yq, zq), (hq, wq, lq) and θq denotes its translation, size (height, width, length)

Right Image

Ground Truth 3D Box

Coarse Prediction

Candidate Location

Receptive Field

Figure 2: Illustration of the local update problem. A 3D region-of-interest (brown) is deﬁned based on the coarse proposal (red). Each candidate (yellow ball) aggregates high-level stereo visual features. Such high-resolution volumetric features are used to infer the ground truth 3D bounding box (green). Best viewed in color. A coarse grid is drawn for visualization while the real grid is much ﬁner.

and orientation respectively. These quantities are represented in the camera coordinate system (CCS). We deﬁne a 3D Ro I rq around pc q, within which V needs to predict the instance pose update. We use a cuboid Ro I for convenience and represent rq as (xq, yq, zq, H, W, L, θq). rq has the same 4-D pose (xq, yq, zq, θq) as pc q and a pre-deﬁned range (H, W, L). The Vernier scale is a ﬁne partition of rq represented as a 3D voxel grid {di,j,k}NH,NW ,NL i=1,j=1,k=1 with NH, NW , and NL voxels uniformly sampled along the height, width and length directions respectively. Each voxel di,j,k is a candidate for precise location update. Fig. 2 illustrates a rq and its corresponding candidates. We represent rq with the camera coordinates of its center and the eight corners by applying a homography encoding translation and rotation as

cosθq 0 sinθq xq 0 1 0 yq sinθq 0 cosθq zq 0 0 0 1

(3) where O3 9 represent the 9 parts in the object coordinate system as

(4) Similarly, di,j,k can be represented in the CCS as

di,j,k = H4 4

NL L/2 + (k 1) L NH H/2 + (i 1) H NW W/2 (j 1) W 1

where [ NL L

2 ] is the left-back-top corner of the 3D grid, and [ L, H, W] = [ L

NL , H NH , W NW ] gives the grid resolution. In experiments, we specify L, H, W as 3, 10, and 3 centimeters respectively and

[NL, NH, NW ] = [192, 32, 128]. Such 3-cm resolution is in contrast to a 20-cm or even coarser resolution used in our M and previous studies (Chen et al. 2020; Li, Su, and Zhao 2021; Garg et al. 2020). Feature aggregation. We design two types of feature aggregation strategies depending on whether one can reuse features computed by M. These two strategies result in two model variants (V-A and V-S). For the V-A (model-agnostic) models, we assume no knowledge of the architecture of M and do not reuse its features. This type of model is useful for dealing with proposals from different 3DOD models or handling new frames that have no features computed yet. In this case, we aggregate feature for each candidate from the left/right visual features LF /RF , which are obtained from a fully convolution network LF /RF = N 2D(L/R; θN ). Speciﬁcally, the feature vector for di,j,k is

Fi,j,k = W(LF , Kldi,j,k) W(RF , Krdi,j,k), (6)

where Kl, Kr are the intrinsic parameters of the left- /right cameras and WF is a warping function and W(LF , Kldi,j,k) extracts the feature at Kldi,j,k from the corresponding location on the feature maps LF . W is implemented as bi-linear interpolation. For the V-S (model-speciﬁc) models, we assume that M also constructs a volumetric scene representation (e.g., a global cost-volume Cg). This case happens when one builds a two-stage voxel-based detector where V has access to precomputed features. Here we sample Fi,j,k from such features as Fi,j,k = W(Cg, di,j,k) to save computation. After aggregation, the grid is colored with high-level visual features. This is in contrast to the low-level pixel intensity used in the classical voxel coloring study (Seitz and Dyer 1999). In addition, all voxels are colored instead of coloring only the consistent voxels (Seitz and Dyer 1999) because the Lambertian scene assumption does not hold. Subsequently a network N 3D processes the aggregated volumetric representation and predicts the output detailed below. Output representation. For each pc q, we encode the ground truth update as dense multi-part conﬁdence since we implement N 3D as a CNN which is known good at dense classiﬁcation tasks. For part m (m = 1, . . . , K), we assign each candidate a ground truth conﬁdence cm i,j,k. In implementation we adopt the ground plane assumption (Roddick, Kendall, and Cipolla 2019) for autonomous driving datasets and ignore the offset in the height direction, which leads to a Bird s Eye View (BEV) conﬁdence map as cm j,k. This does not reduce the generality of our framework and one can keep the original dimension if the height offset is signiﬁcant in a different dataset. Denote the ground truth location for part m as (j , k ). The conﬁdence map is deﬁned as

cm j,k = e (j j )2+(k k )2

In total, K=9 parts including the ground truth cuboid center and its 8 corners are used as similarly deﬁned in Eq. (4). To reduce quantization errors, we transform the predicted conﬁdences ˆcj,k into x-z coordinates ˆYK 2 using several convolution layers as in (Li et al. 2021).

Model instantiation. We have speciﬁed the inputs and outputs of V, where detailed parameters and network components N 2D and N 3D are abstracted away. This makes the framework ﬂexible and one can specify these parameters and sub-networks based on its computation budget. In our study, N 2D is implemented as HRNet-W32 (Sun et al. 2019). After coloring the voxel grid, we use 3D convolution layers and a 3D hourglass network to extract 3D spatial features. The 3D features are pooled and reshaped into BEV feature maps that are further transformed into conﬁdence maps. The network architecture of V is depicted in Fig. 4 and detailed in the SM.

Error-statistics-agnostic Training

We train M and V separately. The training process of M can vary for different possible implementations. For the M used in our study, we use a similar procedure as the anchorbased baseline (Chen et al. 2020). The training supervision consists of a depth regression loss, an anchor classiﬁcation loss, and an offset regression loss. The details are in the SM. Training V is the key step in our framework, where the inputs and targets for training V are not readily available. We propose a simple error-statistics-agnostic (ESA) strategy to synthesize training data. This training strategy does not depend on M and can be used for any proposal model conveniently. Notably, Pose Fix (Moon, Chang, and Lee 2019) proposed a reﬁnement model for a different task and assumes there are certain types of errors from the predictions of the coarse models. Such error statistics are themselves collected from extra validation results. In contrast, our strategy assumes a Gaussian prior and does not require more speciﬁc knowledge of the error behavior. Generating training data. For each ground truth 3D bounding box Bi = (xi, yi, zi, hi, wi, li, θi), we simulate a coarse prediction by adding a noise vector ni = (nx i , ny i , nz i , nh i , nw i , nl i, nθ i ) where n N(0, Σ). We further assume that the noise for each attribute is independent and the covariance matrix Σ is diagonal. In experiments the standard derivations for the above attributes are 0.3m, 0m, 0.3m, 5cm, 5cm, 5cm and 5 respectively. Larger noise can be used if one assumes a weaker M. The simulated coarse prediction pc i = Bi+ni, along with the ground truth conﬁdence maps of Bi, forms one training pair of V. The Gaussian perturbation is added on-line for each instance in every iteration, which behaves like data augmentation so that V does not over-ﬁt to a special subset of inputs. Loss function. We penalize the predicted conﬁdence maps of V with L2 loss as Lconf = L2(ˆcj,k, cj,k). The transformed coordinates are penalized with smooth L1 loss Lcoord = SL1(ˆYK 2, YK 2). For training V-S models, we have LS total = Lconf + Lcoord. For V-A models, apart from supervising these regression targets, we add an extra intermediate supervision since we cannot reuse features encoding the scene depth. We add a 3D convolution head that classiﬁes for each candidate di,j,k whether it is a foreground or not. This serves to add depth cues to train V similar to (Chen et al. 2020) yet different in that we directly add supervision in the 3D space instead

Left Image 1248 384 3

Right Image 1248 384 3

Right Feature W:312 H:96 F:32

Left Feature W:312 H:96 F:32

Matching Cost: W:312 H:96 D:72 F:64

Depth Map Prediction W:384 H:1248

Coarse Global 3D Scene Representation W:304 H:20 D:288 F:64 BEV Feature Maps

Reshape W:304 D:288 F:128

Coarse 3D Instance Proposal pc q

Figure 3: Diagram of the used main scale network M following a similar design of DSGN (Chen et al. 2020). A global voxel grid is used to sample image and cost volume features and construct a volumetric representation of the 3D scene in a predeﬁned spatial range. This volumetric representation is converted into Bird s Eye View (BEV) feature maps. Object proposals are obtained by anchor classiﬁcation and offset regression based on the BEV feature maps. Details are included in our SM.

Pre-computed feature Cg from M

Right Ro I 512 512 3 Right Features RF

W:128 H:128 F:32

Left Features LF W:128 H:128 F:32

Volumetric Features After Coloring W:128 H:32 D:192 F:64

W:128 H:32 D:192 F:32

3D Conv 1 W:128 H:32 D:192 F:32 3D Features

3D Conv 3 Hourglass 1

Foreground Classiﬁcation

W:128 H:32 D:192 1

W:128 H:32 D:192 F:64 3D Features

3D Conv 1 Avg Pool, Reshape 2D Conv 1

W:128 D:192 F:64 BEV Features

Part Conﬁdence ˆcj,k W:128 D:192 K:9

Part Coordinates ˆYK 2 K:9 2

2D Hourglass 1 2D Conv 1

Figure 4: Architecture of the Vernier scale network V. High-resolution features are extracted from left/right regions of interest (Ro Is). A uniform 3D grid conditioned on the current prediction is sampled. Each grid point is projected back to the Ro Is to aggregate image features. Pre-computed features from M are used if available. The grid features are processed by a 3D CNN to predict the part conﬁdence maps which implies the pose update.

of the cost volume used in (Chen et al. 2020). For each point in the captured point cloud in rq, the candidate it occupies after coordinate quantization is treated as a foreground. A candidate outside of the ground truth box Bq is treated as background. All other candidates are not assigned labels since they can be foreground (a foreground not recorded by the Li DAR) as well as background (free space). Since there are much more background candidates than the foreground ones, we use focal loss (Lin et al. 2017) to supervise this classiﬁcation task as

α(1 ˆpi,j,k)γlog(ˆpi,j,k), pi,j,k = 1, (1 α)ˆpγ i,j,klog(1 ˆpi,j,k), pi,j,k = 0 0, else.

where ˆpi,j,k is the predicted foreground probability of candidate di,j,k and pi,j,k is the ground-truth one where pi,j,k = 1 for foreground. We use the default parameters γ = 2 and α = 0.25 and the total training loss LA total = Lconf + Lcoord + Lfg.

Conﬁdence-aware Robust Inference During inference, the update prediction pf q derives from the predicted conﬁdence maps ˆcj,k and x-z coordinates ˆYK 2,

which indicate the tentative update position for each part. Certain study (Peng et al. 2020) on instance-level S3DOD models only predicts the center of the object. This is similar to K = 1 in our framework, where the predicted new center is used to reﬁne the instance translation. However, this could lead to sub-optimal results when the center is occluded or hard to estimate. We thus propose an update strategy by utilizing K=9 parts along with predicted conﬁdence. The predicted coordinates ˆYK 2 may not deﬁne parallel and orthogonal edges. We thus employ a 9-point registration approach by estimating a rigid transformation {ˆR, ˆT} as ˆR, ˆT = arg min R,T W||R YK 2 + T ˆYK 2||, (8)

where YK 2 is the same part coordinates of the current proposal and W is the diagonal matrix with part conﬁdences as its non-zero elements. The closed-form solution to Eq. (8) is ˆR = VT UT (9) ˆT = ˆR Ycentroid + ˆYcentroid (10) where U, V gives the singular decomposition as USV = Y T WˆY where Ycentroid is the average location of the K parts. This solution is the global optimum of Eq. (8) as detailed in the SM. The reﬁned 3D box is then obtained by applying this estimated transformation to the current proposal.

Figure 5: Qualitative results on KITTI val split. Best viewed in color. More results can be found in the supplementary material.

AP3D@R11 APBEV @R11 Method Reference Easy Moderate Hard Easy Moderate Hard MLF (Xu and Chen 2018) CVPR 18 9.80 19.54 TLNet (Qin, Wang, and Lu 2019) CVPR 19 18.15 14.26 13.72 29.22 21.88 18.83 Stereo R-CNN (Li, Chen, and Shen 2019) CVPR 19 54.1 36.7 31.1 68.5 48.3 41.5 PL: F-Point Net (Wang et al. 2019) CVPR 19 59.4 39.8 33.5 72.8 51.8 33.5 PL++: AVOD (You et al. 2019) ICLR 19 63.2 46.8 39.8 77.0 63.7 56.0 IDA-3D (Peng et al. 2020) CVPR 20 54.97 37.45 32.23 70.68 50.21 42.93 Disp-RCNN (Sun et al. 2020) CVPR 20 64.29 47.73 40.11 77.63 64.38 50.68 DSGN (Chen et al. 2020) CVPR 20 72.31 54.27 47.71 83.24 63.91 57.83 Zoom Net (Xu et al. 2020) AAAI 20 62.96 50.47 43.63 78.68 66.19 57.60 RTS-3D (Li, Su, and Zhao 2021) AAAI 21 64.76 46.70 39.27 77.50 58.65 50.14 Disp-RCNN-ﬂb* (Chen et al. 2021) T-PAMI 21 70.11 54.43 47.40 77.47 66.06 57.76 SNVC (Ours) AAAI 22 77.29 63.75 56.81 87.07 72.95 66.77

Table 1: Quantitative comparison of AP3D and APBEV with SOTA stereo 3D object detection approaches on KITTI val split. Disp-RCNN-ﬂb* requires extra mask annotation for training. 11 recall values are used to make the comparison consistent.

Experiments We ﬁrst introduce the used benchmark and evaluation metrics and then compare the overall performance of our MR SNVC with previously published approaches. We then demonstrate how our approach can improve existing 3D object detectors as a plug-and-play module. Finally, we present an ablation study on key design factors of V. Dataset. We employ the KITTI object detection benchmark (Geiger, Lenz, and Urtasun 2012) for evaluation, which contains outdoor RGB images captured with calibrated stereo cameras. The dataset is split into 7,481 training images and 7,518 testing images. The training images are further split into the train split and the val split containing 3,712 and 3,769 images respectively. We use the train split for training and conduct hyper-parameter tuning on the val split. When reporting the model performance on the testing set, both the train split and the val split are used for training. Evaluation metrics. We conduct the evaluation for the car category. We employ the ofﬁcial average precision metrics to validate our approach. 3D Average Precision (AP3D) measures precision at 41 uniformly sampled recall values where a true positive is a predicted 3D box that has 3D intersectionover-union (Io U) > 0.7 with the ground truth. Bird Eye s View Average Precision (APBEV ) instead uses 2D Io U > 0.7 as the criterion where the 3D boxes are projected to the ground plane and the object heights are ignored. The KITTI benchmark further deﬁnes three sets of ground truth labels with different difﬁculty levels as easy, moderate, and hard. The difﬁculty level of one ground truth label is determined

according to its 2D bounding box height, its occlusion level, and its truncation level. Evaluation is performed in parallel in these three different sets of ground truth labels. The hard set contains all the ground truth labels while the easy and moderate sets contain a fraction of easier objects. For consistency with previous works, results on the val split are reported using 11 recall values (denoted as @R11). Training details are attached in our SM. Comparison with state-of-the-arts. Tab. 1 and Tab. 2 compares the 3D object detection performance of our approach with other previously published methods on the val split and the ofﬁcial testing set respectively. Fig. 5 shows 3D object detections of our system on the val split. Our system using V-S outperforms previous approaches in all metrics with a clear margin. Compared with (Chen et al. 2020) that only builds coarse scene-level representation, our system beneﬁts from the proposed HR instance-level model that leads to more precise 3D localization. Compared with the approaches that build instance-level point cloud (Xu et al. 2020; Sun et al. 2020), our voxel-based representation shows superior performance, especially for the hard category. We believe the reason is that the instance point cloud in (Xu et al. 2020; Sun et al. 2020) is sensitive to the depth estimation error and fails to handle distant and occluded objects due to a small number of points and lack of semantic visual features. In this camera-ready version paper, we also refer to a concurrent work (Guo et al. 2021) starting from the same baseline (DSGN) as us which regularizes voxel representation learning by distilling knowledge from an extra pre-trained

Figure 6: Qualitative comparison with RTS-3D (Li, Su, and Zhao 2021) for objects on KITTI val split. The image patches are Ro Is on the left image and the 3D cuboid predictions are shown as bounding boxes in the bird s eye view plots. Note the objects are distant from the cameras.

network. The contribution of distillation is complementary to ours. None of any methods in Tab. 1/2 needs this process known as learning from privileged information (Lopez-Paz et al. 2015) or an extra teacher. Model-agnostic reﬁnement. Tab. 3 shows the result when utilizing our V along with other existing 3D object detectors. We download the pre-trained weights from the ofﬁcial implementation2 of IDA-3D and RTS-3D to generate proposals and use our V to reﬁne them. Note that using V leads to consistent and signiﬁcant performance gain. This result validates that our V can be used as a model-agnostic reﬁnement model. We show predictions for some distant and partiallyoccluded objects in Fig. 6 where our approach obtains better pose estimation performance compared to RTS-3D. RTS-3D can be used to generate coarse predictions in real-time applications while our V can complement it to obtain high-quality predictions only when necessary. Which objects beneﬁt more from V? Tab. 3 shows the overall improvement in 3DOD performance, yet it does not reﬂect how the improvement is related to different object attributes. Fig. 7 instead shows which objects enjoy more performance improvements using the same proposals as in Tab. 3. One can use such knowledge to decide which coarse proposal to reﬁne in practice. Each ground truth (GT) object is assigned to the corresponding bin based on an attribute such as its depth, if there is one matching predicted object. A predicted object matches a ground truth if their 3D Io U > 0.3. The match that has the largest 3D Io U is recorded for each GT object. The average of the matching 3D Io Us is shown as the line plots for GT objects in each bin. The detection quality in terms of AP3D improves for GT objects in all bins after using our V for reﬁnement, in terms of all three attributes that inﬂuence the difﬁculty of detec-

2Different environments led to slight difference compared to the published results in Tab. 1.

Figure 7: 3D object detection performance with respect to ground truth depth (top), occlusion level (middle) and truncation level (bottom) on the KITTI val split. Each bar indicates the number of ground truth (GT) objects that fall in the corresponding bin and are detected with 3D Io U > 0.3. A red cross shows the average 3D Io U for those detected GT objects in a bin before utilizing our V for reﬁnement. A green dot shows such a result after using V. The occlusion level and truncation level are annotated for each ground truth object in the KITTI benchmark.

tion. Note even if the objects are heavily occluded or partially truncated, using V still leads to a robust performance boost. RTS-3D also uses a voxel-based representation, and we can observe that the improvement over it for the middlerange and distant objects is more signiﬁcant than the nearby objects. This validates our assumption that extracting complementary HR features are more helpful for the tiny objects that only have LR representations in the global feature maps. Effect of learning multiple object parts. To validate our multi-part registration approach, we re-train another model to predict only the center conﬁdence. This model uses the predicted center to update the proposal translation during inference. The performance comparison is shown in Tab. 4, where our multi-part strategy leads to consistently better performance since it is more robust to partially-occluded objects whose visible parts can be used to provide a more reliable estimate. Effect of voxel size. To demonstrate the advantage of learning an HR voxel representation, we train V with a varying voxel resolution and the results of using these variants with RTS-3D are shown in Tab. 5. We can observe that using a smaller voxel size signiﬁcantly improves the 3DOD performance which justiﬁes our choice of parameters. While using a much coarser voxel grid (second row) leads to worse results, such performance is still better than using RTS-3D along without our V. However, when the voxel grid is extremely coarse (ﬁrst row), V cannot learn effective volumetric representation for S3DOD. Comparison of space requirement. Building a uniform

Method Reference AP3D@R40 APBEV @R40 Easy Moderate Hard Easy Moderate Hard TLNet (Qin, Wang, and Lu 2019) CVPR 19 7.64 4.37 3.74 13.71 7.69 6.73 Stereo R-CNN (Li, Chen, and Shen 2019) CVPR 19 47.58 30.23 23.72 61.92 41.31 33.42 Zoom Net (Xu et al. 2020) AAAI 20 55.98 38.64 30.97 72.94 54.91 44.14 IDA-3D (Peng et al. 2020) CVPR 20 45.09 29.32 23.13 61.87 42.47 34.59 Disp-RCNN (Sun et al. 2020) CVPR 20 59.58 39.34 31.99 74.07 52.34 43.77 DSGN (Chen et al. 2020) CVPR 20 73.50 52.18 45.14 82.90 65.05 56.60 CDN (Garg et al. 2020) Neur IPS 20 74.52 54.22 46.36 83.32 66.24 57.65 RTS-3D (Li, Su, and Zhao 2021) AAAI 21 58.51 37.38 31.12 72.17 51.79 43.19 Disp-RCNN-ﬂb* (Chen et al. 2021) T-PAMI 21 68.21 45.78 37.73 79.76 58.62 47.73 SNVC (Ours) AAAI 22 78.54 61.34 54.23 86.88 73.61 64.49

Table 2: Overall system performance evaluated with AP3D and APBEV , and compared with SOTA stereo 3D object detection approaches on KITTI test set (ofﬁcial KITTI leader-board).The ofﬁcial recall values are used.

Figure 8: A comparison of needed number of voxels (No Vs) w and w/o our MR strategy. Vanilla: No Vs used for representing a global scene using a uniform resolution. Vernier: No Vs used to model a local scene in our V. Main scale: No Vs used in our M with a resolution of 0.2m. SNVC, N=x: total No Vs as a sum of No Vs used in M plus that used in building local scene for x proposals.

voxel grid for the global scene as used in the scene-centric approaches needs intimidating memory that is not scalable to smaller voxel size. A comparison of the required number of voxels (No Vs) required in such systems and that needed in our proposed MR system is shown in Fig. 8. The No Vs used in the vanilla approach is Q(Lg, Wg, Hg, ) = Lg Wg Hg

3 where [Lg, Wg, Hg] = [60m, 60m, 4m] are the global spatial range and is the voxel size. In contrast, the No Vs used in our approach is NL NH NW N + Q(Lg, Wg, Hg, g) where g is the used global voxel size in the main scale network (0.2m) and N is the number of selected coarse proposals. Note that our SNVC uses significantly fewer No Vs than the vanilla approach to achieve a ﬁner representation beyond g, even for a large number of proposals (N=50). Our framework can thus model the important regions at a high resolution and keep the total memory usage tractable at the same time. More ablation studies and results can be found in our SM which includes how V can be employed for tracking-bydetection using the predictions from the previous frame as proposals.

Method AP3D/APBEV @R11 Easy Moderate Hard RTS-3D 59.31/74.49 41.61/53.98 34.67/46.66 RTS-3D + V 69.25/82.71 52.92/65.75 45.82/57.47 IDA-3D 53.59/69.28 36.79/50.11 32.34/43.31 IDA-3D + V 64.13/80.81 48.89/62.19 43.00/55.03

Table 3: AP3D/APBEV evaluated on the KITTI val split when using V-A as a model-agnostic reﬁnement module with other 3D object detectors.

Method AP3D/APBEV @R11 Easy Moderate Hard Center-only (K=1) 64.94/77.89 47.22/64.00 43.73/55.84 Part-based (K=9) 69.25/82.71 52.92/65.75 45.82/57.47

Table 4: The same evaluation for RTS-3D + V as in Tab. 3 with varying number of parts K.

(NL, NH, NW ) AP3D/APBEV @R11 Easy Moderate Hard (24, 16, 16) 33.65/43.80 27.84/37.38 24.30/32.64 (48, 16, 32) 60.10/73.25 44.31/56.19 37.45/53.30 (192, 32, 128) 69.25/82.71 52.92/65.75 45.82/57.47

Table 5: The same evaluation for RTS-3D + V as in Tab. 3 with varying voxel size.

Conclusion We introduce the idea of multi-resolution modeling to voxelbased stereo 3D object detection by modeling different regions with varying resolutions. This approach can keep the detection problem computationally tractable and can model important regions with smaller voxels to achieve high precision. A new instance-level model is designed, which samples candidate 3D locations and uses predicted object part coordinates to estimate a pose update. Our approach is validated to achieve state-of-the-art stereo 3D object detection performance and can perform model-agnostic reﬁnement. For future study, instead of using a sampling grid with a ﬁxed range, information from previous frames can be used to build a motion model that helps predict the future object location and provide a better clue on where to sample.

Acknowledgments

This work is supported by Hong Kong Research Grants Council (RGC) General Research Fund (GRF) 16203319.

Andriluka, M.; Roth, S.; and Schiele, B. 2008. Peopletracking-by-detection and people-detection-by-tracking. In 2008 IEEE Conference on computer vision and pattern recognition, 1 8. IEEE. Bellotto, N.; Sommerlade, E.; Benfold, B.; Bibby, C.; Reid, I.; Roth, D.; Fern andez, C.; Van Gool, L.; and Gonzalez, J. 2009. A distributed camera system for multi-resolution surveillance. In 2009 Third ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC), 1 8. IEEE. Blaha, M.; Vogel, C.; Richard, A.; Wegner, J. D.; Pock, T.; and Schindler, K. 2016. Large-scale semantic 3d reconstruction: an adaptive multi-resolution model for multi-class volumetric labeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3176 3184. Brazil, G.; and Liu, X. 2019. M3d-rpn: Monocular 3d region proposal network for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9287 9296. Chen, L.; Sun, J.; Xie, Y.; Zhang, S.; Shuai, Q.; Jiang, Q.; Zhang, G.; Bao, H.; and Zhou, X. 2021. Shape Prior Guided Instance Disparity Estimation for 3D Object Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence. Chen, X.; Kundu, K.; Zhu, Y.; Berneshawi, A. G.; Ma, H.; Fidler, S.; and Urtasun, R. 2015. 3d object proposals for accurate object class detection. In Advances in Neural Information Processing Systems, 424 432. Citeseer. Chen, X.; Ma, H.; Wan, J.; Li, B.; and Xia, T. 2017. Multiview 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1907 1915. Chen, Y.; Liu, S.; Shen, X.; and Jia, J. 2020. Dsgn: Deep stereo geometry network for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12536 12545. Cheng, B.; Xiao, B.; Wang, J.; Shi, H.; Huang, T. S.; and Zhang, L. 2020. Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5386 5395. Choy, C. B.; Xu, D.; Gwak, J.; Chen, K.; and Savarese, S. 2016. 3d-r2n2: A uniﬁed approach for single and multiview 3d object reconstruction. In European conference on computer vision, 628 644. Springer. Garg, D.; Wang, Y.; Hariharan, B.; Campbell, M.; Weinberger, K. Q.; and Chao, W.-L. 2020. Wasserstein Distances for Stereo Disparity Estimation. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M. F.; and Lin, H., eds., Advances in Neural Information Processing Systems, volume 33, 22517 22529. Curran Associates, Inc.

Geiger, A.; Lenz, P.; and Urtasun, R. 2012. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3354 3361. IEEE.

Guo, X.; Shi, S.; Wang, X.; and Li, H. 2021. Liga-stereo: Learning lidar geometry aware representations for stereobased 3d detector. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3153 3163.

Ke, L.; Li, S.; Sun, Y.; Tai, Y.-W.; and Tang, C.-K. 2020. Gsnet: Joint vehicle pose and shape reconstruction with geometrical and scene-aware supervision. In European Conference on Computer Vision, 515 532. Springer.

Kundu, A.; Li, Y.; and Rehg, J. M. 2018. 3d-rcnn: Instancelevel 3d object reconstruction via render-and-compare. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3559 3568.

Laine, S.; and Karras, T. 2010. Efﬁcient sparse voxel octrees. IEEE Transactions on Visualization and Computer Graphics, 17(8): 1048 1059.

Lang, A. H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; and Beijbom, O. 2019. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12697 12705.

Li, P.; Chen, X.; and Shen, S. 2019. Stereo r-cnn based 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7644 7652.

Li, P.; Su, S.; and Zhao, H. 2021. RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 35, 1930 1939.

Li, S.; Ke, L.; Pratama, K.; Tai, Y.-W.; Tang, C.-K.; and Cheng, K.-T. 2020. Cascaded deep monocular 3D human pose estimation with evolutionary training data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6173 6183.

Li, S.; Yan, Z.; Li, H.; and Cheng, K.-T. 2021. Exploring intermediate representation for monocular vehicle pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1873 1883.

Li, Z.; Wang, F.; and Wang, N. 2021. Li DAR R-CNN: An Efﬁcient and Universal 3D Object Detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7546 7555.

Liang, M.; Yang, B.; Chen, Y.; Hu, R.; and Urtasun, R. 2019. Multi-task multi-sensor fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7345 7353.

Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Doll ar, P. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, 2980 2988.

Liu, L.; Lu, J.; Xu, C.; Tian, Q.; and Zhou, J. 2019. Deep ﬁtting degree scoring network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1057 1066. Liu, L.; Wu, C.; Lu, J.; Xie, L.; Zhou, J.; and Tian, Q. 2020. Reinforced axial reﬁnement network for monocular 3d object detection. In European Conference on Computer Vision, 540 556. Springer. Lopez-Paz, D.; Bottou, L.; Sch olkopf, B.; and Vapnik, V. 2015. Unifying distillation and privileged information. ar Xiv preprint ar Xiv:1511.03643. Lu, Y.; Ma, X.; Yang, L.; Zhang, T.; Liu, Y.; Chu, Q.; Yan, J.; and Ouyang, W. 2021. Geometry uncertainty projection network for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3111 3121. Moon, G.; Chang, J. Y.; and Lee, K. M. 2019. Poseﬁx: Model-agnostic general human pose reﬁnement network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7773 7781. Peng, W.; Pan, H.; Liu, H.; and Sun, Y. 2020. Ida-3d: Instance-depth-aware 3d object detection from stereo vision for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13015 13024. Qi, C. R.; Liu, W.; Wu, C.; Su, H.; and Guibas, L. J. 2018. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, 918 927. Qin, Z.; Wang, J.; and Lu, Y. 2019. Triangulation learning network: from monocular to stereo 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7615 7623. Reading, C.; Harakeh, A.; Chae, J.; and Waslander, S. L. 2021. Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8555 8564. Riegler, G.; Osman Ulusoy, A.; and Geiger, A. 2017. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3577 3586. Roddick, T.; Kendall, A.; and Cipolla, R. 2019. Orthographic Feature Transform for Monocular 3D Object Detection. In Sidorov, K.; and Hicks, Y., eds., Proceedings of the British Machine Vision Conference (BMVC), 59.1 59.13. BMVA Press. Seitz, S. M.; and Dyer, C. R. 1999. Photorealistic scene reconstruction by voxel coloring. International Journal of Computer Vision, 35(2): 151 173. Shi, S.; Wang, X.; and Li, H. 2019. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 770 779. Snow, D.; Viola, P.; and Zabih, R. 2000. Exact voxel occupancy with graph cuts. In Proceedings IEEE Conference on

Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), volume 1, 345 352. IEEE. Sun, J.; Chen, L.; Xie, Y.; Zhang, S.; Jiang, Q.; Zhou, X.; and Bao, H. 2020. Disp r-cnn: Stereo 3d object detection via shape prior guided instance disparity estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10548 10557. Sun, K.; Xiao, B.; Liu, D.; and Wang, J. 2019. Deep highresolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5693 5703. Vogiatzis, G.; Torr, P. H.; and Cipolla, R. 2005. Multi-view stereo via volumetric graph-cuts. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 05), volume 2, 391 398. IEEE. Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. 2020. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence. Wang, L.; Du, L.; Ye, X.; Fu, Y.; Guo, G.; Xue, X.; Feng, J.; and Zhang, L. 2021a. Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 454 463. Wang, Y.; Chao, W.-L.; Garg, D.; Hariharan, B.; Campbell, M.; and Weinberger, K. Q. 2019. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8445 8453. Wang, Y.; Yang, B.; Hu, R.; Liang, M.; and Urtasun, R. 2021b. PLUME: Efﬁcient 3D Object Detection from Stereo Images. ar Xiv preprint ar Xiv:2101.06594. Weng, X.; and Kitani, K. 2019. Monocular 3d object detection with pseudo-lidar point cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 0 0. Xu, B.; and Chen, Z. 2018. Multi-level fusion based 3d object detection from monocular images. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2345 2353. Xu, Z.; Zhang, W.; Ye, X.; Tan, X.; Yang, W.; Wen, S.; Ding, E.; Meng, A.; and Huang, L. 2020. Zoomnet: Part-aware adaptive zooming neural network for 3d object detection. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 34, 12557 12564. Yan, Y.; Mao, Y.; and Li, B. 2018. Second: Sparsely embedded convolutional detection. Sensors, 18(10): 3337. You, Y.; Wang, Y.; Chao, W.-L.; Garg, D.; Pleiss, G.; Hariharan, B.; Campbell, M.; and Weinberger, K. Q. 2019. Pseudo Li DAR++: Accurate Depth for 3D Object Detection in Autonomous Driving. In ICLR. Zhou, Y.; and Tuzel, O. 2018. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4490 4499.