# rethinking_rotation_invariance_with_point_cloud_registration__ea4c5709.pdf

Rethinking Rotation Invariance with Point Cloud Registration

Jianhui Yu, Chaoyi Zhang, Weidong Cai

School of Computer Science, University of Sydney, Australia {jianhui.yu, chaoyi.zhang, tom.cai}@sydney.edu.au

Recent investigations on rotation invariance for 3D point clouds have been devoted to devising rotation-invariant feature descriptors or learning canonical spaces where objects are semantically aligned. Examinations of learning frameworks for invariance have seldom been looked into. In this work, we review rotation invariance in terms of point cloud registration and propose an effective framework for rotation invariance learning via three sequential stages, namely rotation-invariant shape encoding, aligned feature integration, and deep feature registration. We first encode shape descriptors constructed with respect to reference frames defined over different scales, e.g., local patches and global topology, to generate rotation-invariant latent shape codes. Within the integration stage, we propose Aligned Integration Transformer to produce a discriminative feature representation by integrating point-wise selfand cross-relations established within the shape codes. Meanwhile, we adopt rigid transformations between reference frames to align the shape codes for feature consistency across different scales. Finally, the deep integrated feature is registered to both rotation-invariant shape codes to maximize feature similarities, such that rotation invariance of the integrated feature is preserved and shared semantic information is implicitly extracted from shape codes. Experimental results on 3D shape classification, part segmentation, and retrieval tasks prove the feasibility of our work. Our project page is released at: https://rotation3d.github.io/.

Introduction Point cloud analysis has recently drawn much interest from researchers. As a common form of 3D representation, the growing presence of point cloud data is encouraging the development of many deep learning methods (Qi et al. 2017a; Guo et al. 2021; Zhang et al. 2021), showing great success for well-aligned point clouds on different tasks. However, it is difficult to directly apply 3D models to real data as raw 3D objects are normally captured at different viewing angles, resulting in unaligned data samples, which inevitably impact the deep learning models which are sensitive to rotations. Therefore, rotation invariance becomes an important research topic in the 3D domain. To achieve rotation invariance, a straightforward way is to augment training data with massive rotations which, how-

Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Source Points P! Target Points P"

Registered 𝐏𝐬

Aligned Integration

Transformer

Registration

Registered 𝐔

Local & Global Descriptors

Local Patches Pℓ Global Shape P$

Registration Integration Encoding

Correspondence

Mapping Correspondence

Correspondence

Shape Descriptors

Registration

Figure 1: Frameworks of our design (left) and robust point cloud registration (right), where TI and RI are transformation invariance and rotation invariance, and T is the rigid transformation. The dotted line indicates the computation of T between reference frames.

ever, requires a large memory capacity and exhibits limited generalization ability to unseen data (Kim, Park, and Han 2020). There are attempts to align 3D inputs to a conical pose (Jaderberg et al. 2015; Cohen et al. 2018), or to learn rotation robust features via equivariance (Deng et al. 2021; Luo et al. 2022), while these methods are not rigorously rotation-invariant and present noncompetitive performance on 3D shape analysis. To maintain consistent model behavior under random rotations, some methods (Zhang et al. 2019; Chen et al. 2019; Xu et al. 2021) follow Drost et al. (2010) to handcraft rotation-invariant point-pair features. Others (Zhang et al. 2020; Li et al. 2021a; Zhao et al. 2022) design robust features from equivariant orthonormal bases. Most of the mentioned works either manipulate model inputs or generate canonical spaces to achieve rotation invariance (RI). In this work, we review the problem of RI from a different aspect: robust point cloud registration (PCR). We find that PCR and RI share the same goal: PCR aligns lowdimensional point cloud features (e.g., xyz) from the source domain to the target domain regardless of transformations,

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

while RI can be considered to align high-dimensional latent features to rotation-invariant features. Specifically, the goal of PCR is to explicitly align the source point cloud to the target, both representing the same 3D object, and for RI learning, we implicitly align the final feature representation of a 3D shape to a hidden feature of the same shape, which is universally rotation-invariant to any rotations. Motivated by this finding, we propose our learning framework in Fig. 1 with three sequential stages, namely rotationinvariant shape encoding, aligned feature integration, and deep feature registration. Firstly, we (a) construct and feed point pairs with different scales as model inputs, where we consider local patches Pℓwith small number of points and global shape Pg with the whole 3D points. Hence, the final feature representation can be enriched by information from different scales. Low-level rotation-invariant descriptors are thus built on reference frames and encoded to generate latent shape codes Fℓand Fg following recent PCR work (Pan, Cai, and Liu 2022). Secondly, we (b) introduce a variant of transformer (Vaswani et al. 2017), Aligned Integration Transformer (AIT), to implicitly integrate information from both selfand cross-attention branches for effective feature integration. In this way, information encoded from different point scales is aggregated to represent the same 3D object. Moreover, we consider Fℓand Fg as unaligned since they are encoded from unaligned reference frames. To address the problem, we follow the evaluation technique proposed in PCR (Pan, Cai, and Liu 2022), where we use relative rotation information (T) with learnable layers to align Fℓand Fg for feature consistency. Finally, to ensure RI of the integrated feature U, we follow PCR to (c) examine the correspondence map of (Fg, U) and (Fℓ, U), such that the mutual information between a local patch of a 3D object and the whole 3D object is maximized, and RI is further ensured in the final geometric feature. The contributions of our work are summarized as following three folds: (1) To our knowledge, we are the first in developing a PCR-cored representation learning framework towards effective RI studies on 3D point clouds. (2) We introduce Aligned Integration Transformer (AIT), a transformerbased architecture to conduct aligned feature integration for a comprehensive geometry study from both local and global scales. (3) We propose a registration loss to maintain rotation invariance and discover semantic knowledge shared in different parts of the input object. Moreover, the feasibility of our proposed framework is successfully demonstrated on various 3D tasks.

Related Work Rotation Robust Feature Learning. Networks that are robust to rotations can be equivariant to rotations. Esteves et al. (2018) and Cohen et al. (2018) project 3D data into a spherical space for rotation equivariance and perform convolutions in terms of spherical harmonic bases. Some (Spezialetti et al. 2020; Sun et al. 2021) learn canonical spaces to unify the pose of point clouds. Recent works (Luo et al. 2022; Deng et al. 2021; Jing et al. 2020) vectorize the scalar activations and mapping SO(3) actions to a latent space for easy manipulations. Although these

works present competitive results, they cannot be strictly rotation-invariant. Another way for rotation robustness is to learn rotation-invariant features. Handcrafted features can be rotation-invariant (Zhang et al. 2019; Chen et al. 2019; Chen and Cong 2022; Xu et al. 2021), but they normally ignore the global overview of 3D objects. Others use rotationequivariant local reference frames (LRFs) (Zhang et al. 2020; Thomas 2020; Kim, Park, and Han 2020) or global reference frames (GRFs) (Li et al. 2021a) as model inputs based on principal component analysis (PCA). However, they may produce inconsistent features across different reference frames, which would limit the representational power. In contrast to abovementioned methods with rotation robust model inputs or modules, we examine the relation between RI and PCR and propose an effective framework.

3D Robust Point Cloud Registration. Given a pair of Li DAR scans, 3D PCR requires an optimal rigid transformation to best align the two scans. Despite the recent emerging of ICP-based methods (Besl and Mc Kay 1992; Wang and Solomon 2019b), we follow robust correspondence-based approaches in our work (Deng, Birdal, and Ilic 2018; Yuan et al. 2020; Qin et al. 2022; Pan, Cai, and Liu 2022), where RI is widely used to mitigate the impact of geometric transformations during feature learning. Specifically, both Pan, Cai, and Liu (2022) and Qin et al. (2022) analyze the encoding of transformation-robust information and introduce a rotation-invariant module with contextual information into their registration pipeline. All these methods showing impressive results are closely related to rotation invariance. We hypothesize that the learning framework of RI can be similar to PCR, and we further prove in experiments that our network is feasible and able to achieve competitive performance on rotated point clouds.

Transformers in 3D Point Clouds. Transformers (Dosovitskiy et al. 2021; Liu et al. 2021) applied to 2D vision have shown great success, and they are gaining prominence in 3D point clouds. For example, Zhao et al. (2021) uses vectorized self-attention (Vaswani et al. 2017) and positional embedding for 3D modeling. Guo et al. (2021) proposes offset attention for noise-robust geometric representation learning. Cross-attention is widely employed for semantic information exchange (Qin et al. 2022; Yu et al. 2021a), where feature relations between the source and target domains are explored. Taking advantage of both, we design a simple yet effective feature integration module with self and cross relations. In addition, transformation-related embeddings are introduced for consistent feature learning.

Contrastive Learning with 3D Visual Correspondence. Based on visual correspondence, contrastive learning aims to train an embedding space where positive samples are pushed together whereas negative samples are separated away (He et al. 2020). The definition of positivity and negativity follows the visual correspondence maps, where pairs with high confidence scores are positive otherwise negative. Visual correspondence is important in 3D tasks, where semantic information extracted from matched point pairs improves the network s understanding on 3D geometric struc-

tures. For example, Point Contrast (Xie et al. 2020) explores feature correspondence across multiple views of one 3D point cloud with Info NCE loss (Van den Oord, Li, and Vinyals 2018), increasing the model performance for downstream tasks. Info3D (Sanghi 2020) and Cross Point (Afham et al. 2022) minimize the semantic difference of point features under different poses. We follow the same idea by registering the deep features to rotation-invariant features at intermediate levels, increasing feature similarities in the embedding space to ensure rotation invariance.

Method Given a 3D point cloud including Nin points with xyz coordinates P = {pi R3}Nin i=1, we aim to learn a shape encoder f that is invariant to 3D rotations: f(P) = f(RP), where R SO(3) and SO(3) is the rotation group. RI can be investigated and achieved through three stages, namely rotationinvariant shape encoding (Section ), aligned feature integration (Section ), and deep feature registration (Section ).

Rotation-Invariant Shape Encoding In this section, we first construct the input point pairs from local and global scales based on reference frames, following the idea of Pan, Cai, and Liu (2022) to obtain low-level rotation-invariant shape descriptors from LRFs and GRF directly. Then we obtain latent shape codes via two set abstraction layers as in Point Net++ (Qi et al. 2017b).

Rotation Invariance for Local Patches. To construct rotation-invariant features on LRFs, we hope to construct an orthonormal basis for each LRF as p R3 3. Given a point pi and its neighbor pj N(pi), we choose # xiℓ= # pmpi/ # pmpi 2, where pm is the barycenter of the local geometry and 2 is L2-norm. We then define # ziℓfollowing Tombari, Salti, and Stefano (2010) to have the same direction as an eigenvector, which corresponds to the smallest eigenvalue via eigenvalue decomposition (EVD):

j=1 αj (# pipj) (# pipj) , αj = d # pipj 2 P|N (pi)| j=1 d # pipj 2 ,

(1) where αj is a weight parameter, allowing nearby pj to have large contribution to the covariance matrix, and d is the maximum distance between pi and pj. Finally, we define # yiℓas # ziℓ # xiℓ. RI is introduced to pi with respect to its neighbor pj as pℓ ij = # pipj Mℓ i. Proofs of the equivariance of Mℓ i and invariance of pℓ ij are shown in the supplementary material. The latent shape code Fℓ RN C is obtained via Point Net++ and max-pooling.

Rotation Invariance for Global Shape. We apply PCA as a practical tool to obtain RI in a global scale. Similar to Eq. 1, PCA is performed by 1 N0 PN0 i=1(# pmpi)(# pmpi) = UgΛg Ug , where pm is the barycenter of P, Ug = [# u1g, # u2g, # u3g] and Λg = diag(λg 1, λg 2, λg 3) are eigenvector and eigenvalue matrices. We take Ug as the orthonormal basis Mg = [# x g, # y g, # z g] for GRF. By transforming point pi with Ug, the shape pose is canonicalized as

pg i = pi Mg. Proof of the RI of pg i is omitted for its simplicity, and Fg RN C is obtained following Point Net++.

Sign Ambiguity. EVD introduces sign ambiguity for eigenvectors, which negatively impacts the model performance (Bro, Acar, and Kolda 2008). The description of sign ambiguity states that for a random eigenvector # u, # u and # u , with # u having an opposite direction to # u, are both acceptable solutions to EVD. To tackle this issue, we simply force # ziℓof LRF to follow the direction of # opi, with o being the origin of the world coordinate. We disambiguate basis vectors in Mg by computing an inner product with # pmpi, i N0. Taking # x g for example, its direction is conditioned on the following term:

(# x g, if Sx N0

2 # x g, otherwise , Sx =

i=1 1[ # x g, # pmpi ], (2)

where , is the inner product, 1[ ] is a binary indicator that returns 1 if the input argument is positive, otherwise 0. Sx denotes the number of points where # x g and # pmpi point to the same direction. The same rule is applied to disambiguate # y g and # z g by Sy and Sz. Besides, as mentioned in Li et al. (2021a), Mg might be non-rotational (e.g., reflection). To ensure Mg a valid rotation, we simply reverse the direction of the basis vector whose S value is the smallest. More analyses on sign ambiguity are in the supplementary material.

Aligned Feature Integration Transformer has been widely used in 3D domain to capture long-range dependencies (Yu et al. 2021b). In this section, we introduce Aligned Integration Transformer (AIT), an effective transformer to align latent shape codes with relative rotation angles and integrate information via attentionbased integration (Cheng et al. 2021). Within each AIT module, we first apply Intra-frame Aligned Self-attention on Fℓ and we do not encode Fg, which is treated as supplementary information to assist local geometry learning with the global shape overview. We discuss that encoding Fg via selfattention can increase model overfitting, thus lowering the model performance. We will validate our discussion in Section . Inter-frame Aligned Cross-attention is applied on both Fℓand Fg, and we use Attention-based Feature Integration module for information Aggregation.

Preliminary: Offset Attention. AIT utilizes offset attention (Guo et al. 2021) for noise robustness. In the following, we use subscripts sa and ca to denote implementations related to selfand cross-attention, respectively. We first review offset attention as follows:

F = ϕ(Foa) + Fin, Foa = Fin SM(A) 1v, A = qk , (3) where q = Fin Wq, k = Fin Wk RN d, and v = Fin Wv RN C are query, key, and value embeddings, and Wq, Wk RC d, Wv RC C are the corresponding projection matrices. 1 is L1-norm and ϕ denotes a multi-layer perceptron (MLP) and SM( ) is softmax operation. Foa is offset attention-related feature and A RN N is the attention logits.

𝑾𝒄𝒂 𝒗 𝑾𝒄𝒂 𝒌 𝑾𝒔𝒂

𝒒 𝐞𝒄𝒂 𝜶𝑾𝒄𝒂 𝜶

𝑨𝒄𝒂 𝒂𝒕𝒕𝒏: 𝑁 𝑁 𝑨𝒄𝒂 𝒓𝒐𝒕: 𝑁 𝑁

𝑾𝒔𝒂 𝒗 𝑾𝒔𝒂 𝒌 𝑾𝒔𝒂

𝒒 𝐤𝒔𝒂: 𝑁 𝑑 𝐪𝒔𝒂: 𝑁 𝑑 𝐯𝒔𝒂: 𝑁 𝐶

𝑨𝒔𝒂 𝒂𝒕𝒕𝒏: 𝑁 𝑁 𝑨𝒔𝒂 𝒓𝒐𝒕: 𝑁 𝑁

𝑁 𝑁 𝑑 𝑁 𝑑 𝐯𝒄𝒂: 𝑁 𝐶 𝐤𝒔𝒂: 𝑁 𝑑 𝐪𝒄𝒂: 𝑁 𝑑

(a) Intra-frame Aligned Self-attention (b) Inter-frame Aligned Cross-attention

𝐦𝐚𝐭𝐫𝐢𝐱 𝐩𝐫𝐨𝐝𝐮𝐜𝐭

𝐬𝐮𝐛𝐭𝐫𝐚𝐜𝐭𝐢𝐨𝐧

𝐦𝐚𝐭𝐫𝐢𝐱 𝐩𝐫𝐨𝐝𝐮𝐜𝐭

𝐬𝐮𝐛𝐭𝐫𝐚𝐜𝐭𝐢𝐨𝐧

Figure 2: Illustrations of (a) Intra-frame Aligned Self-attention and (b) Inter-frame Aligned Cross-attention modules. Note that we only present processes for computing Foa in both modules.

Intra-frame Aligned Self-attention. Point-wise features of Fℓare encoded from unaligned LRFs, so direct implementation of self-attention on Fℓcan cause feature inconsistency during integration. To solve this problem, rigid transformations between distinct LRFs are considered, which are explicitly encoded and injected into point-wise relation learning process. We begin by understanding the transformation between two LRFs. For any pair of local orthonormal bases Mℓ i and Mℓ j, a rotation can be easily derived Rji =

Mℓ i Mℓ j and translation is defined as tji = oℓ i oℓ j, where oℓ i/j indicates the origin. In our work, the translation part is intentionally ignored, where we show in the supplementary material that by keeping both rotation and translation information, the model performance decreases. Although Rji is invariant to rotations, we do not directly project it into the embedding space, as it is sensitive to the order of matrix product: Rji = Rij, giving inconsistent rotation information when the product order is not maintained. To address this issue, we construct our embedding via the relative rotation angle αji between Mℓ i and Mℓ j, which is normally used in most PCR works (Yew and Lee 2020; Pan, Cai, and Liu 2022) for evaluations. The relative rotation angle αji is computed as:

αji = arccos Trace ( Rji) 1

π [0, π], (4)

where it is easy to see that αji = αij. We further apply sinusoidal functions on αji to generate N 2 pairs of angular embeddings eα RN N d for all N points as:

eα i,j,2k = sin αji/tα

, eα i,j,2k+1 = cos αji/tα

where tα controls the sensitivity to angle variations. Finally, we inject eα into offset attention and learn intraframe aligned feature Fℓ IAS via self-attention as follows:

Fℓ IAS = ϕ Fℓ oa + Fℓ, Fℓ oa = Fℓ SM(Asa) 1vsa,

Asa = Aattn sa + Arot sa ,

Aattn sa = qsak sa, Arot sa = qsa(eα sa Wα sa) ,

where qsa/ksa/vsa = FℓWq sa/Fl Wk sa/Fl Wv sa, Wα sa Rd d is a linear projection to refine the learning of eα sa, and Asa is the attention logits. The same process can be performed for Fg by swapping the index ℓand g. Detailed illustrations are shown in Fig. 2 (a).

Inter-frame Aligned Cross-attention. Semantic information exchange between Fℓand Fg in the feature space is implemented efficiently by cross-attention (Chen, Fan, and Panda 2021). Since Fℓand Fg are learned from different coordinate systems, inter-frame transformations should be considered for cross-consistency between Fℓand Fg. An illustration of the cross-attention module is shown in Fig. 2 (b), which indicates that the computation of interframe aligned feature Fℓ IAC via cross-attention follows a similar way as Eq. 6 by replacing all subscripts sa by ca. As illustrated in Fig. 2 (b), Aca is cross-attention logits containing point-wise cross-relations over point features defined across local and global scales. eα ca RN d is computed via Eq. 4 and Eq. 5 in terms of the transformation between Mℓ i and Mg. To this end, the geometric features learned between local and global reference frames can be aligned given eα ca, leading to a consistent feature representation.

Attention-based Feature Integration. Instead of simply adding the information from both Fℓand Fg, we integrate

information by incrementing attention logits. Specifically, we apply self-attention on Fℓwith attention logits Asa and cross-attention between Fℓand Fg with attention logits Aca. We combine Asa and Aca via addition, so that encoded information of all point pairs from a local domain can be enriched by the global context of the whole shape. Illustration is shown in the supplementary material. The whole process is formulated as follows:

U = ϕ (Foa) + Fℓ,

Foa = Fℓ SM(Asa + Aca) 1(vsa + vca). (7)

Hence, intra-frame point relations can be compensated by inter-frame information communication in a local-to-global manner, which enriches the geometric representations.

Deep Feature Registration Correspondence mapping (Wang and Solomon 2019a; Pan, Cai, and Liu 2022) plays an important role in PCR, and we discuss that it is also critical for achieving RI in our design. Specifically, although Fℓand Fg are both rotation-invariant by theory, different point sampling methods and the sign ambiguity will cause the final feature not strictly rotationinvariant. To solve this issue, we first examine the correspondence map:

m (X, Y) = exp Φ1(Y)Φ2(X) /t

PN j=1 exp (Φ1(Y)Φ2(xj) /t) , (8)

where Φ1 and Φ2 are MLPs that project latent embeddings X and Y to a shared space, and t controls the variation sensitivity. It can be seen from Eq. 8 that the mapping function m reveals feature similarities in the latent space, and it is also an essential part for 3D point-level contrastive learning in Point Contrast (Xie et al. 2020) for the design of Info NCE losses (Van den Oord, Li, and Vinyals 2018), which have been proven to be equivalent to maximize the mutual information. Based on this observation, we propose a registration loss function Lr = Lℓ r + Lg r, where Lℓ r and Lg r represent the registration loss of (Fℓ,U) and (Fg,U). Mathematically, Lℓ r is defined as follows:

(i,j) M log exp Φ1(Uj)Φ2(f ℓ i ) /t

( ,k) M exp Φ1(Uk)Φ2(f ℓ i ) /t . (9)

The same rule is followed to compute Lg r. Although we follow the core idea of Point Contrast, we differ from it in that Point Contrast defines positive samples based on feature correspondences computed at the same layer level, while our positive samples are defined across layers. The intuition for the loss design is that the 3D shape is forced to learn about its local region as it has to distinguish it from other parts of different objects. Moreover, we would like to maximize the mutual information between different poses of the 3D shape, as features encoded from different poses should represent the same object, which is very useful in achieving RI in SO(3). Moreover, the mutual information between Fℓand Fg is implicitly maximized, such that shared semantic information about geometric structures can be learned, leading to a more geometrically accurate and discriminative representation. More details about Lℓ r can be found in the supplementary material.

Rotation Sensitive z/z z/SO(3) SO(3)/SO(3) Point Net (Qi et al. 2017a) 89.2 16.2 75.5 Poin Net++ (Qi et al. 2017b) 89.3 28.6 85.0 PCT (Guo et al. 2021) 90.3 37.2 88.5 Rotation Robust z/z z/SO(3) SO(3)/SO(3) SFCNN (Rao, Lu, and Zhou 2019) 91.4 84.8 90.1 RIConv (Zhang et al. 2019) 86.5 86.4 86.4 Cluster Net (Chen et al. 2019) 87.1 87.1 87.1 PR-Inv Net (Yu et al. 2020) 89.2 89.2 89.2 RI-GCN (Kim, Park, and Han 2020) 89.5 89.5 89.5 GCAConv (Zhang et al. 2020) 89.0 89.1 89.2 RI-Framework (Li et al. 2021b) 89.4 89.4 89.3 VN-DGCNN (Deng et al. 2021) 89.5 89.5 90.2 SGMNet (Xu et al. 2021) 90.0 90.0 90.0 Li et al. (2021a) 90.2 90.2 90.2 Oriented MP (Luo et al. 2022) 88.4 88.4 88.9 ELGANet (Gu et al. 2022) 90.3 90.3 90.3 Ours 91.0 91.0 91.0

Table 1: Classification results on Model Net40. All methods take raw points of 1024 3 as inputs.

Experiments

We evaluate our model on 3D shape classification, part segmentation, and retrieval tasks under rotations, and extensive experiments are conducted to analyze the network design. Detailed model architectures for the three tasks are shown in the supplementary material. We follow (Esteves et al. 2018) for evaluation: training and testing the network under z-axis (z/z); training under z-axis and testing under arbitrary rotations (z/SO(3)); and training and testing under arbitrary rotations (SO(3)/SO(3)).

3D Object Classification

Synthetic Dataset. We first examine the model performance on the synthetic Model Net40 (Wu et al. 2015) dataset. We sample 1024 points from each data with only xyz coordinates as input features. Hyper-parameters for training follow the same as (Guo et al. 2021), except that points are downsampled in the order of (1024, 512, 128) with feature dimensions of (3, 128, 256). We report and compare our model performance with state-of-the-art (So TA) methods in Table 1. Both rotation sensitive and robust methods achieve great performance under z/z. However, the former could not generalize well to unseen rotations. Rotation robust methods like SFCNN (Rao, Lu, and Zhou 2019) achieve competitive results under z/z, but their performance is not consistent on z/SO(3) and SO(3)/SO(3) due to the imperfect projection from points to voxels when using spherical solutions. We outperform the recent proposed methods (Luo et al. 2022; Xu et al. 2021; Deng et al. 2021) and achieve an accuracy of 91.0%, proving the superiority of our framework on classification.

Real Dataset. Experiments are also conducted on a realscanned dataset. Scan Object NN (Uy et al. 2019) is a commonly used benchmark to explore the robustness to noisy and deformed 3D objects with non-uniform surface density, which includes 2,902 incomplete point clouds in 15 classes.

Method z/SO(3) SO(3)/SO(3) Point Net (Qi et al. 2017a) 16.7 54.7 Point Net++ (Qi et al. 2017b) 15.0 47.4 PCT (Guo et al. 2021) 28.5 45.8 RIConv (Zhang et al. 2019) 78.4 78.1 RI-GCN (Kim, Park, and Han 2020) 80.5 80.6 GCAConv (Zhang et al. 2020) 80.1 80.3 RI-Framework (Li et al. 2021b) 79.8 79.9 LGR-Net (Zhao et al. 2022) 81.2 81.4 VN-DGCNN (Deng et al. 2021) 79.8 80.3 Oriented MP (Luo et al. 2022) 76.7 77.2 Ours 86.6 86.3

Table 2: Classification results on Scan Object NN OBJ BG.

Figure 3: Segmentation comparisons on Shape Net Part, where ground truth (GT) samples are shown for reference. Red dotted circles indicate obvious failures on certain classes, and purple circles denote the slight difference between our design and VN-DGCNN.

We use OBJ BG subset with the background noise and sample 1,024 points under z/SO(3) and SO(3)/SO(3). Table 2 shows that our model achieves the highest results with excellent consistency with random rotations.

3D Part Segmentation Shape part segmentation is a more challenging task than object classification. We use Shape Net Part (Yi et al. 2016) for evaluation, where we sample 2048 points with xyz coordinates as model inputs. The training strategy is the same as the classification task except that the training epoch number is 300. Part-averaged Io U (m Io U) is reported in Table 3, and detailed per-class m Io U values are shown in the supplementary material. Representative methods such as Point Net++ and PCT are vulnerable to rotations. Rotation robust methods present competitive results under z/SO(3), where we achieve the second best result of 80.3%. We give more details of comparison between VN-DGCNN (Deng et al.

Method z/SO(3) SO(3)/SO(3) Point Net (Qi et al. 2017a) 38.0 62.3 Point Net++ (Qi et al. 2017b) 48.3 76.7 PCT (Guo et al. 2021) 38.5 75.2 RIConv (Zhang et al. 2019) 75.3 75.5 RI-GCN (Kim, Park, and Han 2020) 77.2 77.3 RI-Framework (Li et al. 2021b) 79.2 79.4 LGR-Net (Zhao et al. 2022) 80.0 80.1 VN-DGCNN (Deng et al. 2021) 81.4 81.4 Oriented MP (Luo et al. 2022) 80.1 80.9 Ours 80.3 80.4

Table 3: Segmentation results on Shape Net Part. The second best results are underlined.

Method micro m AP macro m AP Score

Spherical CNN (Esteves et al. 2018) 0.685 0.444 0.565 SFCNN (Rao, Lu, and Zhou 2019) 0.705 0.483 0.594 GCAConv (Zhang et al. 2020) 0.708 0.490 0.599 RI-Framework (Li et al. 2021b) 0.707 0.510 0.609 Ours 0.715 0.510 0.613

Table 4: Comparisons of So TA methods on the 3D shape retrieval task.

2021) and our work in the supplementary material, where our method performs better than VN-DGCNN for several classes. Moreover, qualitative results shown in Fig. 3 present that we can achieve visually better results than VN-DGCNN in certain classes such as the airplane and car. More qualitative results are shown in the supplementary material.

3D Shape Retrieval We further conduct 3D shape retrieval experiments on Shape Net Core55 (Chang et al. 2015), which contains two categories of datasets: normal and perturbed. We only use the perturbed part to validate our model performance under rotations. We combine the training and validation sets and validate our method on the testing set following the training policy of (Esteves et al. 2018). Experimental results are reported in Table 4, where the final score is the average value of micro and macro mean average of precision (m AP) as in (Savva et al. 2017). Similar to the classification task, our method achieves So TA performance.

Ablation Study Effectiveness of Transformer Designs. We examine the effectiveness of our transformer design by conducting classification experiments under z/SO(3). We first ablate one or both of the angular embeddings and report the results in Table 5 (models A, B, and C). Model B performs better than model C by 0.4%, which validates our design of feature integration where Mℓ i is used as the main source of information. When both angular embeddings are applied, the best result is achieved (i.e., 91.0%). Moreover, we validate our discussion in Section by comparing models D and E. We

Model eα sa eα ca Fg Asa + Aca Lℓ r Lg r Acc. A 90.0 B 90.6 C 90.2 D 90.2 E 90.4 F 90.0 G 90.2 H 90.6 Ours 91.0

Table 5: Module analysis of AIT and loss functions. Fg means encoding Fg via Intra-frame Aligned Self-attention.

demonstrate in model D that when encoding Fg in the same way as Fℓ, the model performance decreases, which indicates that encoding Fg via self-attention will increase the model overfitting. More analyses can be found in the supplementary material. Finally, we examine the effectiveness of our attention logits-based integration scheme by comparing our model with the conventional method (model E), which applies selfand cross-attention sequentially and repeatedly. We observe that our result is better than model E by 0.6%, indicating that our design is more effective.

Registration Loss. We sequentially ablate Lg r and Lℓ r (models F, G, and H) to check the effectiveness of our registration loss deign. Results in Table 5 demonstrate that we can still achieve a satisfactory result of 90.0% without feature registration. Individual application of Lg r and Lℓ r shows the improvement when forcing the final representation to be close to rotation-invariant features. Moreover, it can be seen that model H performs better than model G, which indicates that intermediate features learned from the global scale are important for shape classification. The best model performance is hence achieved by applying both losses.

Noise Robustness. In real-world applications, raw point clouds contain noisy signals. We conduct experiments to present the model robustness to noise under z/SO(3). Two experiments are conducted: (1) We sample and add Gaussian noise of zero mean and varying standard deviations N(0, σ2) to the input data; (2) We add outliers sampled from a unit sphere to each object. As shown in Fig. 4 (left), we achieve on par results to RI-Framework when std is low, while we perform better while std increases, indicating that our model is robust against high levels of noise. Besides, as the number of noisy points increases, most methods are heavily affected while we can still achieve good results.

Visualization of Rotation Invariance. We further examine RI of learned features. Specifically, we use Grad CAM (Selvaraju et al. 2017) to check how the model pays attention to different parts of data samples under different rotations. Results are reported in Fig. 5 with correspondence between gradients and colors shown on the right. RI-GCN presents a good result, but its behavior is not consistent over some classes (e.g., vase and plant) and it does not pay attention to regions that are critical for classification (see toilet), showing inferior performance to ours. Point Net++ shows no

0.00 0.01 0.02 0.03 0.04 0.05 std of noise

Accuracy (%)

RIConv RI-GCN RI-framework SRINet ours

0 10 20 30 40 50 # of noisy points

Figure 4: Left: Results on Gaussian noise of zero mean and variant standard deviation values. Right: Results on different numbers of noisy points.

airplane guitar vase plant toilet

Point Net++

Figure 5: Network attention on Point Net++, RI-GCN and our model.

resistance to rotations, while our method exhibits a consistent gradient distribution over different parts with random rotations, indicating our network is not affected by rotations.

Conclusion In this work, we rethink and investigate the close relation between rotation invariance and point cloud registration, based on which we propose a PCR-cored learning framework with three stages. With a pair of rotation-invariant shape descriptors constructed from local and global scales, a comprehensive learning and feature integration module is proposed, Aligned Integration Transformer, to simultaneously effectively align and integrate shape codes via selfand crossattentions. To further preserve rotation invariance in the final feature representation, a registration loss is proposed to align it with intermediate features, where shared semantic knowledge of geometric parts is also extracted. Extensive experiments demonstrated the superiority and robustness of our designs. In future work, we will examine efficient methods for invariance learning on large-scale point clouds.

References Afham, M.; Dissanayake, I.; Dissanayake, D.; Dharmasiri, A.; Thilakarathna, K.; and Rodrigo, R. 2022. Crosspoint: Self-supervised cross-modal contrastive learning for 3D point cloud understanding. In CVPR. Besl, P. J.; and Mc Kay, N. D. 1992. Method for registration of 3-D shapes. In Sensor fusion. Bro, R.; Acar, E.; and Kolda, T. G. 2008. Resolving the sign ambiguity in the singular value decomposition. In Jo C. Chang, A. X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. 2015. Shapenet: An information-rich 3D model repository. In ar Xiv:1512.03012. Chen, C.; Li, G.; Xu, R.; Chen, T.; Wang, M.; and Lin, L. 2019. Clusternet: Deep hierarchical cluster network with rigorously rotation-invariant representation for point cloud analysis. In CVPR. Chen, C.-F. R.; Fan, Q.; and Panda, R. 2021. Crossvit: Cross-attention multi-scale vision transformer for image classification. In ICCV. Chen, R.; and Cong, Y. 2022. The Devil is in the Pose: Ambiguity-free 3D Rotation-invariant Learning via Poseaware Convolution. In CVPR. Cheng, R.; Razani, R.; Taghavi, E.; Li, E.; and Liu, B. 2021. (AF)2-S3Net: Attentive feature fusion with adaptive feature selection for sparse semantic segmentation network. In CVPR. Cohen, T. S.; Geiger, M.; K ohler, J.; and Welling, M. 2018. Spherical CNNs. In ICLR. Deng, C.; Litany, O.; Duan, Y.; Poulenard, A.; Tagliasacchi, A.; and Guibas, L. J. 2021. Vector neurons: A general framework for SO (3)-equivariant networks. In ICCV. Deng, H.; Birdal, T.; and Ilic, S. 2018. PPFNet: Global context aware local features for robust 3d point matching. In CVPR. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; and Houlsby, N. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR. Drost, B.; Ulrich, M.; Navab, N.; and Ilic, S. 2010. Model globally, match locally: Efficient and robust 3D object recognition. In CVPR. Esteves, C.; Allen-Blanchette, C.; Makadia, A.; and Daniilidis, K. 2018. Learning SO(3) equivariant representations with spherical CNNs. In ECCV. Gu, R.; Wu, Q.; Li, Y.; Kang, W.; Ng, W.; and Wang, Z. 2022. Enhanced local and global learning for rotationinvariant point cloud representation. In Multi Media. Guo, M.-H.; Cai, J.-X.; Liu, Z.-N.; Mu, T.-J.; Martin, R. R.; and Hu, S.-M. 2021. PCT: Point cloud transformer. In CVM. He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Momentum contrast for unsupervised visual representation learning. In CVPR.

Jaderberg, M.; Simonyan, K.; Zisserman, A.; et al. 2015. Spatial transformer networks. In Neur IPS. Jing, B.; Eismann, S.; Suriana, P.; Townshend, R. J.; and Dror, R. 2020. Learning from protein structure with geometric vector perceptrons. In ICLR. Kim, S.; Park, J.; and Han, B. 2020. Rotation-Invariant Local-to-Global Representation Learning for 3D Point Cloud. In Neur IPS. Li, F.; Fujiwara, K.; Okura, F.; and Matsushita, Y. 2021a. A Closer Look at Rotation-Invariant Deep Point Cloud Analysis. In ICCV. Li, X.; Li, R.; Chen, G.; Fu, C.-W.; Cohen-Or, D.; and Heng, P.-A. 2021b. A rotation-invariant framework for deep point cloud analysis. In TVCG. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; and Guo, B. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV. Luo, S.; Li, J.; Guan, J.; Su, Y.; Cheng, C.; Peng, J.; and Ma, J. 2022. Equivariant Point Cloud Analysis via Learning Orientations for Message Passing. In CVPR. Pan, L.; Cai, Z.; and Liu, Z. 2022. Robust Partial-to-Partial Point Cloud Registration in a Full Range. In IJCV. Qi, C. R.; Su, H.; Mo, K.; and Guibas, L. J. 2017a. Point Net: Deep learning on point sets for 3D classification and segmentation. In CVPR. Qi, C. R.; Yi, L.; Su, H.; and Guibas, L. J. 2017b. Point Net++: Deep hierarchical feature learning on point sets in a metric space. In Neur IPS. Qin, Z.; Yu, H.; Wang, C.; Guo, Y.; Peng, Y.; and Xu, K. 2022. Geometric Transformer for Fast and Robust Point Cloud Registration. In CVPR. Rao, Y.; Lu, J.; and Zhou, J. 2019. Spherical fractal convolutional neural networks for point cloud recognition. In CVPR. Sanghi, A. 2020. Info3d: Representation learning on 3D objects using mutual information maximization and contrastive learning. In ECCV. Savva, M.; Yu, F.; Su, H.; Kanezaki, A.; Furuya, T.; Ohbuchi, R.; Zhou, Z.; Yu, R.; Bai, S.; Bai, X.; et al. 2017. Large-scale 3D shape retrieval from Shape Net Core55: SHREC 17 track. In workshop of 3DOR. Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV. Spezialetti, R.; Stella, F.; Marcon, M.; Silva, L.; Salti, S.; and Di Stefano, L. 2020. Learning to orient surfaces by selfsupervised spherical cnns. In Neur IPS. Sun, W.; Tagliasacchi, A.; Deng, B.; Sabour, S.; Yazdani, S.; Hinton, G. E.; and Yi, K. M. 2021. Canonical Capsules: Self-Supervised Capsules in Canonical Pose. In Neur IPS. Thomas, H. 2020. Rotation-Invariant Point Convolution With Multiple Equivariant Alignments. In 3DV. Tombari, F.; Salti, S.; and Stefano, L. D. 2010. Unique signatures of histograms for local surface description. In ECCV.

Uy, M. A.; Pham, Q.-H.; Hua, B.-S.; Nguyen, D. T.; and Yeung, S.-K. 2019. Revisiting Point Cloud Classification: A New Benchmark Dataset and Classification Model on Real World Data. In ICCV. Van den Oord, A.; Li, Y.; and Vinyals, O. 2018. Representation learning with contrastive predictive coding. In ar Xiv:1807.03748. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Neur IPS. Wang, Y.; and Solomon, J. M. 2019a. Deep closest point: Learning representations for point cloud registration. In ICCV. Wang, Y.; and Solomon, J. M. 2019b. Prnet: Self-supervised learning for partial-to-partial registration. In Neur IPS. Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; and Xiao, J. 2015. 3D shapenets: A deep representation for volumetric shapes. In CVPR. Xie, S.; Gu, J.; Guo, D.; Qi, C. R.; Guibas, L.; and Litany, O. 2020. Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In ECCV. Xu, J.; Tang, X.; Zhu, Y.; Sun, J.; and Pu, S. 2021. SGMNet: Learning Rotation-Invariant Point Cloud Representations via Sorted Gram Matrix. In ICCV. Yew, Z. J.; and Lee, G. H. 2020. Rpm-net: Robust point matching using learned features. In CVPR. Yi, L.; Kim, V. G.; Ceylan, D.; Shen, I.-C.; Yan, M.; Su, H.; Lu, C.; Huang, Q.; Sheffer, A.; and Guibas, L. 2016. A scalable active framework for region annotation in 3D shape collections. In ACM To G. Yu, H.; Li, F.; Saleh, M.; Busam, B.; and Ilic, S. 2021a. Co Fi Net: Reliable Coarse-to-fine Correspondences for Robust Point Cloud Registration. In Neur IPS. Yu, J.; Zhang, C.; Wang, H.; Zhang, D.; Song, Y.; Xiang, T.; Liu, D.; and Cai, W. 2021b. 3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis. In ar Xiv:2112.04863. Yu, R.; Wei, X.; Tombari, F.; and Sun, J. 2020. Deep Positional and Relational Feature Learning for Rotation Invariant Point Cloud Analysis. In ECCV. Yuan, W.; Eckart, B.; Kim, K.; Jampani, V.; Fox, D.; and Kautz, J. 2020. Deepgmr: Learning latent gaussian mixture models for registration. In ECCV. Zhang, C.; Yu, J.; Song, Y.; and Cai, W. 2021. Exploiting Edge-Oriented Reasoning for 3D Point-Based Scene Graph Analysis. In CVPR. Zhang, Z.; Hua, B.-S.; Chen, W.; Tian, Y.; and Yeung, S.-K. 2020. Global context aware convolutions for 3D point cloud understanding. In 3DV. Zhang, Z.; Hua, B.-S.; Rosen, D. W.; and Yeung, S.-K. 2019. Rotation invariant convolutions for 3D point clouds deep learning. In 3DV. Zhao, C.; Yang, J.; Xiong, X.; Zhu, A.; Cao, Z.; and Li, X. 2022. Rotation invariant point cloud analysis: Where local geometry meets global topology. In Pattern Recognition.

Zhao, H.; Jiang, L.; Jia, J.; Torr, P. H.; and Koltun, V. 2021. Point transformer. In ICCV.