# dcdet_dynamic_crossbased_3d_object_detector__3bedb738.pdf

DCDet: Dynamic Cross-based 3D Object Detector

Shuai Liu , Boyang Li , Zhiyu Fang and Kai Huang

School of Computer Science and Engineering, Sun Yat-sen University {liush376@mail2, liby83@mail, fangzhy9@mail2, huangk36@mail}.sysu.edu.cn

Recently, signiﬁcant progress has been made in the research of 3D object detection. However, most prior studies have focused on the utilization of center-based or anchor-based label assignment schemes. Alternative label assignment strategies remain unexplored in 3D object detection. We ﬁnd that the center-based label assignment often fails to generate sufﬁcient positive samples for training, while the anchor-based label assignment tends to encounter an imbalanced issue when handling objects with different scales. To solve these issues, we introduce a dynamic cross label assignment (DCLA) scheme, which dynamically assigns positive samples for each object from a cross-shaped region, thus providing sufﬁcient and balanced positive samples for training. Furthermore, to address the challenge of accurately regressing objects with varying scales, we put forth a rotationweighted Intersection over Union (RWIo U) metric to replace the widely used L1 metric in regression loss. Extensive experiments demonstrate the generality and effectiveness of our DCLA and RWIo Ubased regression loss. The Code is available at https://github.com/Say2L/DCDet.git.

1 Introduction

3D object detection plays a crucial role in enabling unmanned vehicles to perceive and understand their surroundings, which is fundamental for ensuring safe driving. Label assignment is a key process for training 3D object detectors. The dominant label assignment strategies in 3D object detection are anchor-based [Shi et al., 2020a; Xu et al., 2022; Zheng et al., 2021] and center-based [Ge et al., 2020; Hu et al., 2022; Yin et al., 2021; Wang et al., 2021]. However, both of these label assignment schemes encounter issues that limit the performance of detectors. The anchor-based label assignment generally encounters an imbalanced problem when assigning positive samples to objects with different scales. It employs the prior knowledge

Corresponding author. Appendix is presented at https://arxiv.org/abs/2401.07240.

of spatial scale for each category to predeﬁne ﬁxed-size anchors on the grid map. By comparing the intersection over union (Io U) between anchors and ground-truth boxes, positive anchors are determined to classify and regress objects. Consequently, the anchor-based label assignment tends to exhibit an uneven distribution of positive anchors across objects of different sizes. For example, car objects typically have a signiﬁcantly higher number of positive anchors compared to pedestrian objects. This imbalance poses a challenge during training and leads to slow convergence for small objects. Moreover, the anchor-based label assignment scheme necessitates the recalculation of statistical data distribution for different datasets to obtain optimal anchor sizes. This requirement may reduce the robustness of a trained detector when applied to datasets with distinct data distributions. The center-based label assignment scheme often faces challenges in providing adequate positive samples for training. This approach has recently been adopted by various 3D object detectors [Ge et al., 2020; Hu et al., 2022; Yin et al., 2021; Wang et al., 2021]. It focuses solely on object centers as positive samples (similar to positive anchors). As a result, the number of positive samples remains consistent across objects of different scales, solving the issue of imbalanced positive sample distribution encountered in anchorbased label assignment. However, the center-based label assignment overlooks many potential high-quality positive samples, as only one positive sample per object is responsible for regressing object attributes. This leads to an inefﬁcient utilization of training data and sub-optimal network performance. To simultaneously address the aforementioned challenges, this paper introduces a dynamic cross label assignment (DCLA), which aims to provide balanced and ample highquality positive samples for objects of different scales. Speciﬁcally, DCLA dynamically assigns positive samples for each object within a cross-shaped region. The size of this region is determined by a distance parameter, which represents the Manhattan distance from the object s center point. Given the varying scale and potential missing points in point clouds, a dynamic selection strategy is employed to adaptively choose positive samples from the cross-shaped region. As a result, each object is assigned sufﬁcient positive samples, and objects of different scales receive a similar number of positive samples, effectively mitigating the issue of posi-

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

tive sample imbalance. Moreover, a rotation-weighted Io U (RWIo U) is introduced to accurately regress objects. In the 2D domain, the Io Ubased loss [Rezatoﬁghi et al., 2019; Zheng et al., 2020; Zhang et al., 2022a] is conﬁrmed to be better than the Lnorm loss. However, in 3D object detection, the development of the Io U-based loss lags behind its 2D counterpart. This challenge arises due to the increased degrees of freedom in the 3D domain. The proposed RWIo U utilizes the idea of rotation weighting, thus elegantly integrating the rotation and direction attributes of objects into the Io U metric. The RWIo U loss can replace the Lnorm and direction losses to help detectors achieve higher accuracy. Finally, a 3D object detection framework dubbed DCDet is proposed which combines the DCLA and RWIo U. The contributions of this work are summarized as follows:

We thoroughly investigate the current widely used label assignment strategies and analyze their pros and cons. Based on experimental observations, we introduce a new label assignment strategy called dynamic cross label assignment (DCLA).

We propose a rotation-weighted Io U (RWIo U) to better measure the proximity of two rotation boxes compared to the L1 metric. RWIo U takes rotations and directions of 3D objects into consideration simultaneously.

A 3D object detector dubbed DCDet is proposed which combines the DCLA and RWIo U. Extensive experiments on the Waymo Open [Sun et al., 2020] and KITTI [Geiger et al., 2012] datasets demonstrate the effectiveness and generality of our methods.

2 Related Work

2.1 3D Object Detection

Voxel Net [Zhou and Tuzel, 2018] encodes voxel features using Point Net [Qi et al., 2017a], and then extracts features from 3D feature maps through 3D convolutions. SECOND [Yan et al., 2018] efﬁciently encodes sparse voxel features by proposed 3D sparse convolution. Point Pillars [Lang et al., 2019] divides a point cloud into pillar voxels, avoiding the use of 3D convolution and achieving high inference speed. 3DSSD [Yang et al., 2020] signiﬁcantly improves inference speed by discarding upsampling layers and reﬁnement networks commonly used in point-based methods. Point RCNN [Shi et al., 2019] produces proposals from raw points using Point Net++ [Qi et al., 2017b], and then reﬁnes bounding boxes in the second stage. PV-RCNN [Shi et al., 2020a] uses features of internal points to reﬁne proposals. Voxel RCNN [Deng et al., 2021] replaces the features of raw points in the second-stage reﬁnement with 3D voxel features in the 3D backbone.

2.2 Label Assignment

Label assignment, which is fundamental to 2D and 3D object detection, signiﬁcantly inﬂuences the optimization of a network. Its development is more mature in 2D object detection, with Retina Net [Lin et al., 2017] assigning anchors

Sample in Cross-

shaped Region Bounding Box Grid Cell

1 r = 3 r =

Figure 1: Cross-shaped region for different grid cell sizes.

on the output grid map, FCOS [Tian et al., 2019] designating grid points within the range of ground truth boxes as positive samples, and Center Net [Zhou et al., 2019b] identifying center points of ground truth boxes as positive samples. ATSS [Zhang et al., 2020] and Auto Assign [Zhu et al., 2020] propose adaptive strategies for dynamic threshold selection and dynamic positive/negative conﬁdence adjustment, respectively. YOLOX [Ge et al., 2021] introduces the Sim OTA scheme for dynamic positive sample selection. Conversely, 3D object detection label assignment is less developed, grappling with unique challenges such as maintaining a balance of positive samples across various object sizes. Current methods in 3D object detection typically use either anchor-based [Yan et al., 2018; Lang et al., 2019; Deng et al., 2021] or center-based [Yin et al., 2021; Ge et al., 2020; Hu et al., 2022] label assignment schemes. However, these schemes have drawbacks: the anchor-based label assignment often results in unbalanced assignments, and the center-based label assignment may overlook high-quality samples. To simultaneously overcome the above two drawbacks, we propose the dynamic cross label assignment (DCLA). Details about the DCLA are described in the methodology section.

2.3 Io U-based Loss Io U-based losses [Rezatoﬁghi et al., 2019; Zheng et al., 2020; Zhang et al., 2022a] without rotation have been well studied in 2D object detection. These methods not only ensure consistency between the training objective and the evaluation metric but also normalize object attributes, leading to enhanced performance compared to the Lnorm loss. Due to their success in 2D object detection, some 3D object detection methods [Zhou et al., 2019a; Sheng et al., 2022; Shi et al., 2022] incorporate Io U-based losses. 3DIo U [Zhou et al., 2019a] extends Io U calculation from 2D to 3D by considering rotation. However, the optimization direction of 3DIo U-based loss can be opposite to the correct direction. To address this, RDIo U [Sheng et al., 2022] decouples rotation from 3DIo U. It considers rotation as an attribute similar to object location, but it doesn t consider object direction. A direction loss is needed for classifying object directions. ODIo U [Shi et al., 2022] combines L1 metric and axis-aligned Io U to regress objects. Our proposed RWIo U incorporates both rotation and direction into the Io U metric, eliminating the need for Lnorm and direction losses. Details of RWIo U will be explained in the next section.

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

Multi-task Head

cls reg iou

Sort & Select

Cross-shaped sampling region

Box #1 Box #2 Box #3

Positive samples

Dynamic Cross Label Assignment

Figure 2: The overall framework of our DCDet. The dynamic cross label assignment scheme is only used in the training phase.

3 Methodology

This section will describe the dynamic cross label assignment (DCLA) and the rotation-weighted Io U (RWIo U) in detail. The overall framework is illustrated in Figure 2.

3.1 Dynamic Cross Label Assignment

The label assignment schemes used in existing 3D object detection methods are generally based on prior information such as spatial ranges or object scales to manually select positive samples. For example, the anchor-based label assignment uses object scales to set the sizes of anchors and then uses anchors with Io U greater than a certain threshold as positive samples. The anchor-based label assignment generally produces unbalanced positive samples for different-scale objects, causing the model to prioritize large-scale objects. The center-based label assignment usually takes the center points of grounding truths as positive samples. This can result in a large number of good-quality samples being discarded, leaving inefﬁcient utilization of training data. The above label assignment schemes have a common property, they all use static prior information as the selection criteria. And the prior information is determined by human experience. Dynamic label assignment schemes [Zhang et al., 2020; Zhu et al., 2020; Ge et al., 2021] have shown their advantages in 2D object detection. However, directly transferring these schemes to 3D object detection is not trivial. There are some challenges: 1) There is no space to dynamically select positive samples for small objects (e.g. pedestrians). Because small objects generally cover one or two grid points on the output map; 2) The coverage of objects with different scales varies greatly. This easily results in an imbalance of positive samples between different scale objects. To dynamically select sufﬁcient high-quality positive samples while maintaining the balance between different scale objects, we propose a dynamic cross label assignment (DCLA) scheme. Speciﬁcally, it limits the positive sampling range in a cross-shape region for each object. Typically, an object s center region on a feature map contains enough features to identify it [Tian et al., 2019], and objects in point clouds have regular shapes. Therefore, we only use the center

point and its surrounding points for positive sampling in the DCLA scheme. We refer to this sampling range as the cross region. It can be adjusted by a parameter r to adapt to outputs with different grid cell sizes as illustrated in Figure 1. r is the Manhattan distance away from the center point. When r = 1, the cross region covers the center and its top, down, left, and right neighbors. And when r = 0, the DCLA degenerates to the center-based label assignment. The implementation steps of the DCLA are described in detail next. Given a ground truth bt and positions P in its cross region, calculate the selection cost as follows:

cj = Lcls j + λreg Lreg j , j P, (1)

where the Lcls j and Lreg j are the classiﬁcation loss and regression loss between the ground truth bt and j-th prediction bo j respectively, and λreg is the weight of regression loss. Then, sort the predictions in the cross region according to the selection costs. Next, sum the Io Us between the ground truth bt i and predictions bo j, j P:

j P Io U(bt, bo j) , 1). (2)

We utilize k as the number of positive samples for ground truth bt. Finally, select the top k predictions as positive samples. And the rest predictions are negative samples. Speciﬁcally, given a point cloud input and the ground truth boxes {bt 1, bt 2, , bt n}, we assume that f(bt i, bo ij) represents the regression loss function, where bt i and bo ij denote the i-th ground truth box and its j-th predicted box, respectively. Therefore, the regression loss ℓfor the point cloud is calculated as follows:

j=1 f(bt i, bo ij),

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

inter V Align axes Calculate intersecting

Rotation weighting

Calculate the

Rotation difference

Rotation weighting item

Rotationweighted Io U

weighted inter V V = 1 2

V RWIo U V V V = +

Intersecting

Figure 3: The calculation process of RWIo U.

where N represents the total number of positive samples in the input point cloud, and ki denotes the number of positive samples assigned for the ground truth bt i. Notably, ki is calculated independently for each ground truth, as in Eq. (2). It is related to the number of high-quality samples in the cross region and is not dependent on the ground truth scale. However, in the anchor-based label assignment, ki varies signiﬁcantly with the ground truth scale, resulting in a bias towards large-scale objects in the loss. For the center-based label assignment, ki is always equal to 1, leading to inefﬁcient utilization of training data. We adopt the heatmap target for the classiﬁcation task. The weights of positive samples are set to 1, and the weights of negative samples in cross regions are set to the values of Io U between predicted boxes and ground-truth boxes. As for the rest negative samples, the weights are all set to 0.

3.2 Rotation-Weighted Io U

In general, different object categories exhibit signiﬁcant scale variations, and various attributes such as location, size, and rotation also possess scale differences. Many existing methods employ the Lnorm loss as the regression loss. However, this loss function renders the model sensitive to differences in both object and attribute scales. Consequently, large objects and attributes dominate the total loss. The Io U metric can normalize object attributes, making it immune to scale differences. Moreover, the optimization objective of the Io U-based loss aligns with the evaluation metrics of detection models. Therefore, substituting the Lnorm loss with the Io U-based loss often yields accuracy improvement [Zheng et al., 2020; Rezatoﬁghi et al., 2019; Sheng et al., 2022]. Utilizing Io U-based loss in 3D object detection poses several challenges. Firstly, calculating traditional Io U requires the computation of polyhedron volumes, which is a complex and computationally expensive task. Secondly, the traditional Io U-based loss, due to its tight coupling with rotation, can lead to misdirection in optimization, resulting in training instability [Sheng et al., 2022]. Lastly, integrating the traditional Io U metric with object directions is not trivial. Therefore, the inclusion of L1 loss or direction loss becomes nec-

essary to aid models in classifying object directions. To tackle the aforementioned challenges, we propose a rotation-weighted Io U (RWIo U). It thoroughly decouples the rotation from the Io U calculation, making the computation similar to the axis-aligned Io U computation. RWIo U can be implemented with just a few lines of code. By integrating sine and cosine values of rotations of objects into a rotation weighting item, our RWIo U can penalize rotation and direction errors simultaneously. The RWIo U calculation process is shown in Figure 3. It ﬁrst considers two rotation boxes B1 and B2 as axis-aligned boxes, and then calculates the intersecting volume of the two axis-aligned boxes as follows:

s L = max (x1 l1/2, x2 l2/2) , s R = min (x1 + l1/2, x2 + l2/2) , s B = max (y1 w1/2, y2 w2/2) , s T = min (y1 + w1/2, y2 + w2/2) , s D = max (z1 h1/2, z2 h2/2) , s U = min (z1 + h1/2, z2 + h2/2) , Vinter = max (s R s L, 0) max (s T s B, 0) max (s U s D, 0) ,

where (xi, yi, zi), i {1, 2} denote the locations of box centers, (li, wi, hi), i {1, 2} represent the sizes of boxes, and Vinter denotes the intersecting volume of two axis-aligned boxes. Then, we update the Vinter according to the rotation difference of the two boxes as follows:

Vweighted =ωVinter, ω =ωsωc,

ωs =(1 α|sinθ2 sinθ1|

ωc =(1 α|cosθ2 cosθ1|

where θ1 and θ2 represent rotations of two boxes, ωs and ωc denote the sine and cosine rotation error factor respectively that are all normalized to the range of [0, 1], ω represents the

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

rotation weighting item, Vweighted is the rotation-weighted value of Vinter, α [0, 1] is a hyper-parameter which is used to control the contribution of rotation to the RWIo U. If α = 0, the RWIo U degrades to axis-aligned Io U. After obtaining Vweighted, the value of RWIo U can be calculated as follows:

Vunion = V1 + V2 Vweighted ,

RWIo U = Vweighted

Vunion , (6)

where V1 and V2 represent the volumes of two boxes, respectively. The gradient analysis of the RWIo U is in Appendix.

3.3 Loss Function

Single-stage detectors typically encounter misalignment between classiﬁcation conﬁdence and localization accuracy. To solve the misalignment problem, we follow Zheng et al. to introduce an extra Io U prediction branch. The classiﬁcation loss Lcls and Io U prediction loss Liou are the same as those of CIA-SSD [Zheng et al., 2021]. The regression loss Lreg is based on the RWIo U. It is calculated as follows:

i=1 1 RWIo Ui + ( Di Diagi )2, (7)

where N is the total number of positive samples, RWIo Ui and Di represent the RWIo U value and the L2 distance of centers, respectively. Additionally, Diagi denotes the diagonal length of the minimal enclosing rectangle of the i-th predicted box and its ground truth. The term Di Diagi is used to optimize the prediction of center locations. Since our RWIo U incorporates sine and cosine functions to represent the rotation angle of a bounding box, the need for a direction loss is eliminated. The overall loss function is calculated as follows:

L = λcls Lcls + λreg Lreg + λiou Liou, (8)

where λcls, λreg, and λdir are the weight of classiﬁcation, regression, and direction losses, respectively.

4 Experiments

In this section, we evaluate models on widely-used 3D object detection benchmark datasets including Waymo Open [Sun et al., 2020] and KITTI [Geiger et al., 2012].

4.1 Implementation Setup

Data Preprocessing For the Waymo Open dataset, the detection range is [ 74.88, 74.88]m for the X and Y axes and [ 2, 4]m for the Z axis, the voxel size is set to (0.08, 0.08, 0.15)m. For the KITTI dataset, the detection range is [0, 70.4]m for the X axis, [ 40, 40]m for the Y axis, and [ 5, 3]m for the Z axis, the voxel size is set to (0.05, 0.05, 0.1)m.

Training Details The backbone of our DCDet is the same as that of Center Point [Yin et al., 2021]. Following Pillar Ne Xt [Li et al., 2023], we use a feature upsampling in the detection head of DCDet, which increase the output resolution with only a little overhead. All models are trained from scratch in an end-to-end manner with the Adam optimizer and a 0.003 learning rate. And the parameter α used in Eq. (4) is set to 0.5. The parameters λcls and λiou used in Eq. (7) are all set to 1. And the parameter λreg used in Eq. (1) and Eq. (7) is set to 3. For the Waymo Open and KITTI datasets, the parameter r used in DCLA is set to 1 and 3, respectively. On the Waymo Open and KITTI datasets, models are trained for 30 epochs with a batch size of 24 and 80 epochs with a batch size of 8, respectively. Hyper-parameters analysis is in Appendix.

4.2 Comparison with State-of-the-Art Methods The baseline models presented in Table 1 primarily utilize either center-based or anchor-based label assignment. Moreover, they commonly employ Lnorm regression loss. As depicted in Table 1, the center-based label assignment demonstrates a signiﬁcant advantage over the anchor-based label assignment on the Waymo Open dataset. Nevertheless, our DCDet, featuring a lightweight single-stage network, surpasses the state-of-the-art center-based method DSVT, which employs a heavy backbone network. Notably, even our DCDet model trained on only 20% of the training samples outperforms both the center-based and anchor-based methods trained on the entire dataset. These results demonstrate the superior performance of our DCDet framework which employs DCLA and RWIo U-based regression loss. We also evaluated our DCDet on the Waymo Open test set by submitting the results to the ofﬁcial server. The performance comparisons are presented in Table 2, revealing that our DCDet surpasses previous state-of-the-art methods signiﬁcantly. Particularly, in the case of small-scale categories such as pedestrians and cyclists, our method demonstrates a substantial advantage due to the balanced and sufﬁcient positive samples provided by DCLA.

4.3 Effect on Different Backbone Networks To assess the generality of our DCLA and RWIo U, we conduct experiments by incorporating them into several widely used backbone networks, namely SECOND, Pillar Net, and DSVT. All models are reproduced using the Open PCDet [Team, 2020] codebase. We train these models using both 20% and 100% of the training data from the Waymo Open dataset and present the results in Table 3. As evident from the table, the integration of our DCLA and RWIo U yields signiﬁcant improvements across all model groups. This underscores the generality and effectiveness of our proposed DCLA and RWIo U techniques. Notably, the DCLA and RWIo U-based regression loss belong to the learning strategies of models, resulting in cost-free improvements. Even when trained on only 20% of the training data, the models integrated with our DCLA and RWIo U techniques either surpass or catch up to the performance of models trained on the entire training data without these enhancements. This demonstrates that our learning strategies enhance the utilization of training data,

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

Method Stages LEVEL 2 LEVEL 1 LEVEL 2 m AP/m APH Vehicle Pedestrian Cyclist Vehicle Pedestrian Cyclist Li DAR R-CNN (a) [Li et al., 2021] 2 65.8/61.3 76.0/75.5 71.2/58.7 68.6/66.9 68.3/67.9 63.1/51.7 66.1/64.4 Part-A2-Net (a) [Shi et al., 2020b] 2 66.9/63.8 77.1/76.5 75.2/66.9 68.6/67.4 68.5/68.0 66.2/58.6 66.1/64.9 Voxel R-CNN (a) [Deng et al., 2021] 2 68.6/66.2 76.1/75.7 78.2/72.0 70.8/69.7 68.2/67.7 69.3/63.6 68.3/67.2 PV-RCNN (c) [Shi et al., 2020a] 2 69.6/67.2 78.0/77.5 79.2/73.0 71.5/70.3 69.4/69.0 70.4/64.7 69.0/67.8 PV-RCNN++ (c) [Shi et al., 2023] 2 71.7/69.5 79.3/78.8 81.8/76.3 73.7/72.7 70.6/70.2 73.2/68.0 71.2/70.2 FSD [Fan et al., 2022b] 2 72.9/70.8 79.2/78.8 82.6/77.3 77.1/76.0 70.5/70.1 73.9/69.1 74.4/73.3 SECOND* (a) [Yan et al., 2018] 1 61.0/57.2 72.3/71.7 68.7/58.2 60.6/59.3 63.9/63.3 60.7/51.3 58.3/57.0 Point Pillars* (a) [Lang et al., 2019] 1 62.8/57.8 72.1/71.5 70.6/56.7 64.4/62.3 63.6/63.1 62.8/50.3 61.9/59.9 IA-SSD (a) [Zhang et al., 2022b] 1 66.8/63.3 70.5/69.7 69.4/58.5 67.7/65.3 61.6/61.0 60.3/50.7 65.0/62.7 SST* (a) [Fan et al., 2022a] 1 67.8/64.6 74.2/73.8 78.7/69.6 70.7/69.6 65.5/65.1 70.0/61.7 68.0/66.9 Center Point (c) [Yin et al., 2021] 1 68.2/65.8 74.2/73.6 76.6/70.5 72.3/71.1 66.2/65.7 68.8/63.2 69.7/68.5 Vox Set (c) [He et al., 2022] 1 69.1/66.2 74.5/74.0 80.0/72.4 71.6/70.3 66.0/65.6 72.5/65.4 69.0/67.7 Pillar Net (c) [Shi et al., 2022] 1 71.0/68.5 79.1/78.6 80.6/74.0 72.3/66.2 70.9/70.5 72.3/66.2 69.7/68.7 AFDet V2 (c) [Hu et al., 2022] 1 71.0/68.8 77.6/77.1 80.2/74.6 73.7/72.7 69.7/69.2 72.2/67.0 71.0/70.1 Center Former (c) [Zhou et al., 2022] 1 71.1/68.9 75.0/74.4 78.6/73.0 72.3/71.3 69.9/69.4 73.6/68.3 69.8/68.8 Swin Former (c) [Sun et al., 2022] 1 -/- 77.8/77.3 80.9/72.7 -/- 69.2/68.8 72.5/64.9 -/- Pillar Ne Xt (c) [Li et al., 2023] 1 71.9/69.7 78.4/77.9 82.5/77.1 73.2/72.2 70.3/69.8 74.9/69.8 70.6/69.6 DSVT (Pillar) (c) [Wang et al., 2023] 1 73.2/71.0 79.3/78.8 82.8/77.0 76.4/75.4 70.9/70.5 75.2/69.8 73.6/72.7 DCDet (20%) (ours) 1 74.0/71.5 79.2/78.7 83.8/77.6 77.4/76.3 71.0/70.6 76.2/70.2 74.8/73.7 DCDet (ours) 1 75.0/72.7 79.5/79.0 84.1/78.5 79.4/78.3 71.6/71.1 76.7/71.3 76.8/75.7

Table 1: Performance comparisons on the Waymo Open validation set. The results of AP/APH are reported. *: reported by [Fan et al., 2022b]. : reported by [Shi et al., 2023]. : reported by [Wang et al., 2023]. a and c denote the anchor-based and center-based label assignment, respectively. 20% denotes only 20% training samples are used.

Method LEVEL 2 LEVEL 1 LEVEL 2 m AP/m APH Vehicle Pedestrian Cyclist Vehicle Pedestrian Cyclist Center Point [Yin et al., 2021] - 80.2/79.7 78.3/72.1 - 72.2/71.8 72.2/66.4 - PV-RCNN [Shi et al., 2020a] 71.2/68.8 80.6/80.2 78.2/72.0 71.8/70.4 72.8/72.4 71.8/66.1 69.1/67.8 Pillar Net-18 [Shi et al., 2022] 71.3/68.5 81.9/81.4 80.0/72.7 68.0/66.8 74.5/74.0 74.0/67.1 65.5/64.4 AFDet V2 [Hu et al., 2022] 72.2/70.0 80.5/80.0 79.8/74.4 72.4/71.2 73.0/72.6 73.7/68.6 69.8/68.7 PV-RCNN++ [Shi et al., 2023] 72.4/70.2 81.6/81.2 80.4/75.0 71.9/70.8 73.9/73.5 74.1/69.0 69.3/68.2 DCDet (ours) 75.7/73.3 82.2/81.7 83.4/77.8 77.3/76.1 74.8/74.4 77.5/72.1 74.7/73.5

Table 2: Performance comparisons on the Waymo Open test set by submitting to the ofﬁcial test evaluation server. The results are achieved by using single point cloud frames. No test-time augmentations are used.

which is particularly valuable considering the high cost associated with labeling 3D bounding boxes.

4.4 Ablation Study

To further study the inﬂuence of each component of DCDet, we perform a comprehensive ablation analysis on the Waymo Open and KITTI datasets. For the Waymo Open dataset, following prior works [Shi et al., 2020a; Wang et al., 2023], models are trained on 20% training samples and evaluated on the whole validation samples. For the KITTI dataset, models are trained on the train set and evaluated on the val set.

Effect of RWIo U and DCLA The baseline model adopts center-based label assignment and L1 regression loss. To evaluate the effectiveness of our proposed methods, we systematically integrate RWIo U-based regression loss and DCLA into the baseline model. The ablation results are presented in Table 4. We observe a notable performance improvement when incorporating RWIo U-based regression loss, as demonstrated by the results in the 1st and 2nd rows of Table 4. This suggests that the proposed loss function is better suited for the task of 3D object detection compared to the traditional L1 loss. Furthermore, models

trained with DCLA consistently achieve signiﬁcantly better performance than the baseline, as illustrated in the 1st and 3rd rows of Table 4. This indicates that DCLA facilitates improved utilization of the available training data, thus enhancing the overall model performance. Notably, when both RWIo U-based regression loss and DCLA are used, the model achieves the highest performance among all evaluated models. These ﬁndings validate the effectiveness of our proposed methods and highlight the importance of carefully designing the loss function and label assignment for improving the performance of 3D object detectors.

Comparison with Other Regression Losses Table 5 provides a comparison of different regression losses. All models utilize the DCLA scheme and the same backbone network. The results in the 1st, 2nd, and 3rd rows of Table 5 reveal marginal differences between the L1, RDIo Ubased [Sheng et al., 2022], and ODIo U-based [Shi et al., 2022] regression losses. However, our RWIo U-based loss exhibits a signiﬁcant performance improvement compared to the other regression losses, as demonstrated in the 4th row of Table 5. These results highlight the effectiveness of our RWIo U, which decouples the rotation from Io U calcula-

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

Method Training Data LEVEL 1 LEVEL 2 m AP/m APH Vehicle Pedestrian Cyclist m AP/m APH Vehicle Pedestrian Cyclist SECOND 20% 64.8/60.4 70.9/70.3 65.8/54.8 57.8/56.2 58.7/54.7 62.6/62.0 57.8/48.0 55.7/54.2 SECOND* 20% 73.4/70.0 74.0/73.3 77.0/69.1 69.2/67.7 67.1/64.0 65.7/65.2 68.7/61.3 66.9/65.4 Improvement N/A +8.6/+9.6 +3.1/+3.0 +11.2/+14.3 +11.4/+11.5 +8.4/+9.3 +3.1/+3.2 +10.9/+13.3 +11.2/+11.2 Pillar Net 20% 71.6/68.0 72.9/72.3 73.0/64.1 68.9/67.6 65.6/62.3 64.9/64.4 65.3/57.2 66.5/65.2 Pillar Net* 20% 75.1/70.9 75.6/75.0 78.1/67.7 71.7/70.0 69.0/65.1/ 67.8/67.3 70.0/60.4 69.2/67.6 Improvement N/A +3.5/+2.9 +2.7/+2.7 +5.1/+3.6 +2.8/+2.4 +3.4/+2.8 +2.9/+2.9 +4.7/+3.2 +2.7/+2.4 DSVT 20% 78.3/75.3 78.1/77.6 82.3/74.8 74.6/73.5 72.2/69.3 69.8/69.3 74.7/67.7 72.0/71.0 DSVT* 20% 79.8/76.5 79.2/78.7 83.6/75.3 76.5/75.4 73.7/70.6 71.1/70.7 76.2/68.3 73.9/72.8 Improvement N/A +1.5/+1.2 +1.1/+1.1 +1.3/+0.5 +1.9/+1.9 +1.5/+1.3 +1.3/+1.4 +1.5/+0.6 +1.9/+1.8 SECOND 100% 67.2/63.1 72.3/71.7 68.7/58.2 60.6/59.3 61.0/57.2 63.9/63.3 60.7/51.3 58.3/57.1 SECOND* 100% 74.2/71.0 74.4/73.8 78.4/70.8 69.9/68.5 68.0/65.1 66.3/65.9 70.2/63.2 67.5/66.1 Improvement N/A +7.0/+7.9 +2.1/+2.1 +9.7/+12.6 +9.3/+9.2 +7.0/+11.9 +2.4/+2.6 +9.5/+12.9 +9.2/+9.0 Pillar Net 100% 73.4/70.0 74.0/73.5 75.3/66.9 70.8/69.6 67.4/64.3 66.2/65.7 67.7/60.0 68.3/67.1 Pillar Net* 100% 75.7/71.9 75.8/75.3 79.1/69.7 72.2/70.7 69.7/66.1 68.2/67.6 71.1/62.4 69.8/68.4 Improvement N/A +2.3/+1.9 +1.8/+1.8 +3.8/+2.8 +1.4/+1.1 +2.3/+1.8 +2.0/+1.9 +3.4/+2.4 +1.5/+1.3 DSVT 100% 80.1/77.4 79.1/78.6 82.7/76.3 78.4/77.3 73.8/71.3 70.9/70.5 75.0/68.9 75.6/74.6 DSVT* 100% 81.5/78.7 80.4/79.9 84.5/77.4 79.7/78.6 75.7/72.9 72.6/72.1 77.2/70.4 77.2/76.2 Improvement N/A +1.4/+1.3 +1.3/+1.3 +1.8/+1.1 +1.3/+1.3 +1.9/+1.6 +1.7/+1.6 +2.2/+1.5 +1.6/+1.6

Table 3: Effect on different backbone networks. The results of AP/APH on the Waymo Open validation set are reported. * represents that our DCLA and RWIo U-based regression loss are applied.

RWIOU DCLA Vehicle Pedestrian Cyclist 69.2/68.7 73.4/68.5 72.6/71.5 69.9/69.3 74.3/68.5 74.1/73.1 70.5/70.0 75.2/69.7 74.4/73.3 71.0/70.5 75.9/70.1 75.1/74.0

Table 4: Effect of different components of DCDet. RWIo U and DCLA denote RWIo U-based regression loss and dynamic cross label assignment, respectively. The LEVEL 2 AP/APH results on the Waymo Open validation set are reported.

tion by introducing rotation weighting. Notably, the RDIo Ubased loss necessitates an additional direction classiﬁcation loss, and the ODIo U-based loss requires an extra L1 loss. In contrast, our RWIo U-based loss is a pure Io U-based loss without any auxiliary losses. This simpliﬁcation allows our approach to achieve superior performance without introducing additional complexity.

Comparison with Other Label Assignment Schemes Table 6 compares different label assignment schemes with all models using the RWIo U-based regression loss and the same backbone network. As depicted in the 1st and 3rd rows of Table 6, both anchor-based and box-based label assignment exhibit subpar performance when it comes to small objects like pedestrians and cyclists. This is mainly due to the unbalanced assignment of positive samples for objects with different scales. On the other hand, the center-based label assignment, as shown in the 2nd row of Table 6, achieves good results on the Waymo Open dataset but performs poorly on the KITTI dataset. We argue that this discrepancy arises from overlooking a large number of excellent samples, resulting in an insufﬁcient number of positive samples for training on small-scale datasets like KITTI. Moreover, the poor performance of sim OTA [Ge et al., 2021] in 3D object detection, as demonstrated in the 4th row of Table 6, highlights the challenges of directly transferring methods from the 2D domain to the 3D domain. However, our DCLA outperforms these baseline label assignment schemes on both the Waymo Open and

Regression Loss Vehicle Pedestrian Cyclist L1 70.3/69.8 75.0/69.6 74.0/73.0 RDIo U-based 70.2/69.7 74.8/69.3 74.3/73.2 ODIo U-based 70.5/70.0 75.2/69.7 74.4/73.3 RWIo U-based 71.0/70.5 75.9/70.1 75.1/74.0

Table 5: Comparison results of different regression losses. The LEVEL 2 AP/APH results on the Waymo Open validation set are reported.

Lable Assignment Waymo KITTI Vehicle Pedestrian Cyclist Mod. Car Anchor-based 67.8/67.3 63.4/55.5 67.7/66.5 85.37 Center-based 69.9/69.3 74.3/68.5 74.1/73.1 75.49 Box-based 67.8/67.4 66.2/61.4 69.9/69.0 85.32 sim OTA 68.7/68.3 67.8/63.1 72.2/71.2 85.45 DCLA 71.0/70.5 75.9/70.1 75.1/74.0 85.82

Table 6: Comparison results of different label assignment schemes. The LEVEL 2 AP/APH results on the Waymo Open validation set and moderate APR40 results on the KITTI val are reported.

KITTI datasets, as illustrated in the last row of Table 6. This conﬁrms that our DCLA can adapt to datasets of different scales by enabling balanced and adequate positive sampling.

5 Conclusion

In this paper, we propose a dynamic cross label assignment (DCLA), which dynamically assigns positive samples from a cross-shaped region for each object. The DCLA scheme mitigates the imbalanced issue in the anchor-based assignment and the loss of high-quality samples in the center-based assignment. Thanks to the balanced and adequate positive sampling, DCLA effectively adapts to different scale datasets. Moreover, a rotation-weighted Io U (RWIo U), which considers the rotation and direction in a weighting way, is introduced to measure the proximity of two rotation boxes. Extensive experiments conducted on various datasets demonstrate the generality and effectiveness of our methods.

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

Acknowledgments

This work is supported by the Project of Guangxi Key R & D Program (No. Guike AB24010324).

References [Deng et al., 2021] Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. Voxel r-cnn: Towards high performance voxel-based 3d object detection. In AAAI, 2021. [Fan et al., 2022a] Lue Fan, Ziqi Pang, Tianyuan Zhang, Yu Xiong Wang, Hang Zhao, Feng Wang, Naiyan Wang, and Zhaoxiang Zhang. Embracing single stride 3d object detector with sparse transformer. In CVPR, 2022. [Fan et al., 2022b] Lue Fan, Feng Wang, Naiyan Wang, and ZHAO-XIANG ZHANG. Fully sparse 3d object detection. In Neur IPS, 2022. [Ge et al., 2020] Runzhou Ge, Zhuangzhuang Ding, Yihan Hu, Yu Wang, Sijia Chen, Li Huang, and Yuan Li. Afdet: Anchor free one stage 3d object detection. ar Xiv preprint ar Xiv:2006.12671, 2020. [Ge et al., 2021] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021. ar Xiv preprint ar Xiv:2107.08430, 2021. [Geiger et al., 2012] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012. [He et al., 2022] Chenhang He, Ruihuang Li, Shuai Li, and Lei Zhang. Voxel set transformer: A set-to-set approach to 3d object detection from point clouds. In CVPR, 2022. [Hu et al., 2022] Yihan Hu, Zhuangzhuang Ding, Runzhou Ge, Wenxin Shao, Li Huang, Kun Li, and Qiang Liu. Afdetv2: Rethinking the necessity of the second stage for object detection from point clouds. In AAAI, 2022. [Lang et al., 2019] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019. [Li et al., 2021] Zhichao Li, Feng Wang, and Naiyan Wang. Lidar r-cnn: An efﬁcient and universal 3d object detector. In CVPR, 2021. [Li et al., 2023] Jinyu Li, Chenxu Luo, and Xiaodong Yang. Pillarnext: Rethinking network designs for 3d object detection in lidar point clouds. In CVPR, 2023. [Lin et al., 2017] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ar. Focal loss for dense object detection. In ICCV, 2017. [Qi et al., 2017a] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classiﬁcation and segmentation. In CVPR, 2017. [Qi et al., 2017b] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Neur IPS, 2017.

[Rezatoﬁghi et al., 2019] Hamid Rezatoﬁghi, Nathan Tsoi, Jun Young Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, 2019. [Sheng et al., 2022] Hualian Sheng, Sijia Cai, Na Zhao, Bing Deng, Jianqiang Huang, Xian-Sheng Hua, Min-Jian Zhao, and Gim Hee Lee. Rethinking iou-based optimization for single-stage 3d object detection. In ECCV, 2022. [Shi et al., 2019] Shaoshuai Shi, Xiaogang Wang, and Hongsheng Li. Pointrcnn: 3d object proposal generation and detection from point cloud. In CVPR, 2019. [Shi et al., 2020a] Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In CVPR, 2020. [Shi et al., 2020b] Shaoshuai Shi, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. TPAMI, 2020. [Shi et al., 2022] Guangsheng Shi, Ruifeng Li, and Chao Ma. Pillarnet: Real-time and high-performance pillarbased 3d object detection. In ECCV, 2022. [Shi et al., 2023] Shaoshuai Shi, Li Jiang, Jiajun Deng, Zhe Wang, Chaoxu Guo, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection. IJCV, 2023. [Sun et al., 2020] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020. [Sun et al., 2022] Pei Sun, Mingxing Tan, Weiyue Wang, Chenxi Liu, Fei Xia, Zhaoqi Leng, and Dragomir Anguelov. Swformer: Sparse window transformer for 3d object detection in point clouds. In ECCV, 2022. [Team, 2020] Open PCDet Development Team. Openpcdet: An open-source toolbox for 3d object detection from point clouds. https://github.com/open-mmlab/ Open PCDet, 2020. [Tian et al., 2019] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos: Fully convolutional one-stage object detection. In ICCV, 2019. [Wang et al., 2021] Qi Wang, Jian Chen, Jianqiang Deng, and Xinfang Zhang. 3d-centernet: 3d object detection network for point clouds with center estimation priority. Pattern Recognition, 2021. [Wang et al., 2023] Haiyang Wang, Chen Shi, Shaoshuai Shi, Meng Lei, Sen Wang, Di He, Bernt Schiele, and Liwei Wang. Dsvt: Dynamic sparse voxel transformer with rotated sets. In CVPR, 2023. [Xu et al., 2022] Qiangeng Xu, Yiqi Zhong, and Ulrich Neumann. Behind the curtain: Learning occluded shapes for 3d object detection. In AAAI, 2022.

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)

[Yan et al., 2018] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embedded convolutional detection. Sensors, 2018. [Yang et al., 2020] Zetong Yang, Yanan Sun, Shu Liu, and Jiaya Jia. 3dssd: Point-based 3d single stage object detector. In CVPR, 2020. [Yin et al., 2021] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center-based 3d object detection and tracking. In CVPR, 2021. [Zhang et al., 2020] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In CVPR, 2020. [Zhang et al., 2022a] Yi-Fan Zhang, Weiqiang Ren, Zhang Zhang, Zhen Jia, Liang Wang, and Tieniu Tan. Focal and efﬁcient iou loss for accurate bounding box regression. Neurocomputing, 2022. [Zhang et al., 2022b] Yifan Zhang, Qingyong Hu, Guoquan Xu, Yanxin Ma, Jianwei Wan, and Yulan Guo. Not all points are equal: Learning highly efﬁcient point-based detectors for 3d lidar point clouds. In CVPR, 2022. [Zheng et al., 2020] Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren. Distance-iou loss: Faster and better learning for bounding box regression. In AAAI, 2020. [Zheng et al., 2021] Wu Zheng, Weiliang Tang, Sijin Chen, Li Jiang, and Chi-Wing Fu. Cia-ssd: Conﬁdent iou-aware single-stage object detector from point cloud. In AAAI, 2021. [Zhou and Tuzel, 2018] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In CVPR, 2018. [Zhou et al., 2019a] Dingfu Zhou, Jin Fang, Xibin Song, Chenye Guan, Junbo Yin, Yuchao Dai, and Ruigang Yang. Iou loss for 2d/3d object detection. In 3DV, 2019. [Zhou et al., 2019b] Xingyi Zhou, Dequan Wang, and Philipp Kr ahenb uhl. Objects as points. ar Xiv preprint ar Xiv:1904.07850, 2019. [Zhou et al., 2022] Zixiang Zhou, Xiangchen Zhao, Yu Wang, Panqu Wang, and Hassan Foroosh. Centerformer: Center-based transformer for 3d object detection. In ECCV, 2022. [Zhu et al., 2020] Benjin Zhu, Jianfeng Wang, Zhengkai Jiang, Fuhang Zong, Songtao Liu, Zeming Li, and Jian Sun. Autoassign: Differentiable label assignment for dense object detection. ar Xiv preprint ar Xiv:2007.03496, 2020.

Proceedings of the Thirty-Third International Joint Conference on Artiﬁcial Intelligence (IJCAI-24)