# segment_any_3d_gaussians__145b58ee.pdf

Segment Any 3D Gaussians

Jiazhong Cen1, Jiemin Fang2, Chen Yang1, Lingxi Xie2, Xiaopeng Zhang2, Wei Shen1*, Qi Tian2

1Mo E Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University 2Huawei Technologies Co., Ltd. {jiazhongcen, ycyangchen, wei.shen}@sjtu.edu.cn, {jaminfong, 198808xc, zxphistory}@gmail.com, tian.qi1@huawei.com

This paper presents SAGA (Segment Any 3D GAussians), a highly efficient 3D promptable segmentation method based on 3D Gaussian Splatting (3D-GS). Given 2D visual prompts as input, SAGA can segment the corresponding 3D target represented by 3D Gaussians within 4 ms. This is achieved by attaching an scale-gated affinity feature to each 3D Gaussian to endow it a new property towards multi-granularity segmentation. Specifically, a scale-aware contrastive training strategy is proposed for the scale-gated affinity feature learning. It 1) distills the segmentation capability of the Segment Anything Model (SAM) from 2D masks into the affinity features and 2) employs a soft scale-gate mechanism to deal with multigranularity ambiguity in 3D segmentation through adjusting the magnitude of each feature channel according to a specified 3D physical scale. Evaluations demonstrate that SAGA achieves real-time multi-granularity segmentation with quality comparable to state-of-the-art methods. As one of the first methods addressing promptable segmentation in 3D-GS, the simplicity and effectiveness of SAGA pave the way for future advancements in this field.

Code https://github.com/Jumpat/Seg Any GAussians

1 Introduction Promptable segmentation has attracted increasing attention and has seen significant advancements, particularly with the development of 2D segmentation foundation models such as the Segment Anything Model (SAM) (Kirillov et al. 2023). However, 3D promptable segmentation remains relatively unexplored due to the scarcity of 3D data and the high cost of annotation. To address these challenges, many studies (Cen et al. 2023; Chen et al. 2023; Ying et al. 2024; Kim et al. 2024; Fan et al. 2023; Liu et al. 2024) have proposed to extend SAM s 2D segmentation capabilities to 3D using radiance fields, achieving notable success. In this paper, we focus on promptable segmentation in 3D Gaussian Splatting (3D-GS) (Kerbl et al. 2023), which represents a significant milestone in radiance fields research due to its superior rendering quality and efficiency compared to its predecessors. We highlight that, in contrast to previous

*Corresponding author Copyright 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

radiance fields, the explicit 3D Gaussian structure is an ideal carrier for 3D segmentation, as segmentation capabilities can be integrated into 3D-GS as an intrinsic attribute, without necessitating an additional bulky segmentation module.

Accordingly, we propose SAGA (Segment Any 3D GAussians), a 3D promptable segmentation method that integrates the segmentation capabilities of SAM into 3D-GS seamlessly. SAGA takes 2D visual prompts as input and outputs the corresponding 3D target represented by 3D Gaussians. To achieve this purpose, two primary challenges are faced. First, SAGA should figure out an efficient way to endow each 3D Gaussian with the ability of 3D segmentation, so that the high efficiency of 3D-GS can be preserved. Second, as a robust promptable segmentation method, SAGA must effectively address multi-granularity ambiguity, where a single 3D Gaussian may belong to different parts or objects at varying levels of granularity.

To address the two challenges, SAGA respectively introduces two solutions. First, SAGA attaches an affinity feature to each 3D Gaussian in a scene to endow it with a new property towards segmentation. The similarity between two affinity features indicates whether the corresponding 3D Gaussians belong to the same 3D target. Second, inspired by GARField (Kim et al. 2024), SAGA employs a soft scalegate mechanism to handle multi-granularity ambiguity. Depending on a specified 3D physical scale, the scale gate adjusts the magnitude of each feature channel. This mechanism maps the Gaussian affinity features into different sub-spaces for various scales, thereby preserving the multi-granularity information and meanwhile mitigating the distraction in feature learning brought by multi-granularity ambiguity. To realize the two solutions, SAGA proposes a scale-aware contrastive training strategy, which distills the segmentation capability of SAM from 2D masks into scale-gated affinity features. This strategy determines the correlations between a pair of pixels within an image based on 3D scales. These correlations are then used to supervise the rendered affinity features through a correspondence distillation loss. The correlation information is transmitted to the Gaussian affinity features via backpropagation facilitated by the differentiable rasterization algorithm. After training, SAGA achieves realtime multi-granularity segmentation precisely.

The Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25)

Figure 1: SAGA performs promptable multi-granularity segmentation within milliseconds. Prompts are marked by points.

2 Related Work

2D Promptable Segmentation The task of 2D promptable segmentation is proposed by Kirillov et al. (2023), which aims to return segmentation masks given input prompts that specify the segmentation target in an image. To address this problem, they introduce the Segment Anything Model (SAM), a groundbreaking segmentation foundation model. A similar model to SAM is SEEM (Zou et al. 2023), which also achieves competitive performance. Prior to these models, the most closely related task to promptable 2D segmentation is interactive image segmentation (Boykov and Jolly 2001; Grady 2006; Gulshan et al. 2010; Rother, Kolmogorov, and Blake 2004; Chen et al. 2022b; Sofiiuk, Petrov, and Konushin 2022; Liu et al. 2023b). Inspired by the success of SAM, many studies (Yang et al. 2023; Xu et al. 2023; Guo et al. 2024; Yin et al. 2024) proposed to use SAM for 3D segmentation. Different from them, we focus on lifting the ability of SAM to 3D via 3D-GS.

3D Segmentation in Radiance Fields With the success of radiance fields (Mildenhall et al. 2020; Sun, Sun, and Chen 2022; Chen et al. 2022a; Barron et al. 2022; M uller et al. 2022; Hedman et al. 2024; Fridovich-Keil et al. 2022; Wizadwongsa et al. 2021; Lindell, Martel, and Wetzstein 2021; Fang et al. 2022), numerous studies have explored 3D segmentation within them. Zhi et al. (2021) proposed Semantic-Ne RF, demonstrating the potential of Neural Radiance Field (Ne RF) in semantic propagation and refinement. NVOS (Ren et al. 2022) introduced an interactive approach to select 3D objects from Ne RF by training a lightweight MLP using custom-designed 3D features. By using 2D self-supervised models, approaches like N3F (Tschernezki et al. 2022), DFF (Kobayashi, Matsumoto, and Sitzmann 2022), and ISRF (Goel et al. 2023) aim to elevate 2D

visual features to 3D by training additional feature fields that can output 2D feature maps, imitating the original 2D features from different views. Ne RF-SOS (Fan et al. 2023) and Contrastive Lift (Bhalgat et al. 2023) distill the 2D feature similarities into 3D features. There are also some other instance segmentation and semantic segmentation approaches (Stelzner, Kersting, and Kosiorek 2021; Niemeyer and Geiger 2021; Yu, Guibas, and Wu 2022; Liu et al. 2022, 2023c; Bing, Chen, and Yang 2023; Fu et al. 2022; Vora et al. 2022; Siddiqui et al. 2023) for radiance fields. Combined with CLIP (Radford et al. 2021), some approaches (Kerr et al. 2023; Liu et al. 2023a; Bhalgat et al. 2024; Qin et al. 2024) proposed to conduct open-vocabulary 3D segmentation in radiance fields. With the popularity of SAM, a stream of studies (Kim et al. 2024; Ying et al. 2024; Ye et al. 2024; Cen et al. 2023; Lyu et al. 2024) proposed lifting the segmentation ability of SAM to 3D with radiance fields. SA3D (Cen et al. 2023) adopts an iterative pipeline to refine the 3D mask grids with SAM. Gaussian Grouping (Ye et al. 2024) uses video tracking technology to align the inconsistent 2D masks extracted by SAM across different views and assigns labels to 3D Gaussians in a 3D-GS model with the aligned masks. Omni Seg3D (Ying et al. 2024) employs a hierarchical contrastive learning method to automatically learn segmentation from multi-view 2D masks extracted by SAM. The approach most closely related to SAGA is GARField (Kim et al. 2024), which addresses multigranularity ambiguity in 3D segmentation using 3D physical scale, inspiring SAGA s scale gate mechanism. However, GARField s reliance on implicit feature fields for outputting 3D features requires repeated queries for segmentation at different scales, reducing efficiency. In contrast, the scalegate mechanism of SAGA enhances efficiency by integrating directly with 3D-GS without additional computation.

3 Method In this section, we first give a brief review of 3D Gaussian Splatting (3D-GS) (Kerbl et al. 2023) and the scaleconditioned 3D features (Kerr et al. 2023; Kim et al. 2024). Then we introduce the overall pipeline of SAGA, followed by explanation of the scale-gated Gaussian affinity features and the scale-aware contrastive learning.

3.1 Preliminary 3D Gaussian Splatting (3D-GS) Given a training dataset I of multi-view 2D images with camera poses, 3D-GS learns a set of 3D colored Gaussians G = {g1, g2, ..., g N}, where N denotes the number of 3D Gaussians in the scene. The mean of a Gaussian represents its position and the covariance indicates its scale. Accordingly, 3D-GS proposes a novel differentiable rasterization technology for efficient training and rendering. Given a specific camera pose, 3DGS projects the 3D Gaussians to 2D and computes the color C(p) of a pixel p by blending a set of ordered Gaussians Gp overlapping the pixel. Let gp i denote the i-th Gaussian in Gp, this process is formulated as:

i=1 cgp i αgp i

j=1 (1 αgp j ), (1)

where cgp i is the color of gp i and αgp i is given by evaluating the corresponding 2D Gaussian with covariance Σ multiplied with a learned per-Gaussian opacity.

Scale-Conditioned 3D Feature LERF (Kerr et al. 2023) first proposes the concept of a scale-conditioned feature field for learning from global image embeddings obtained from CLIP. GARField (Kim et al. 2024) then introduces it into the area of radiance field segmentation to tackle the multigranularity ambiguity. To compute the 3D mask scale s M of a 2D mask M, GARField projects M into 3D space with the camera intrinsic parameters and depth information predicted by a pre-trained radiance field. Let P denote the obtained point cloud, X(P), Y(P), Z(P) denote the set of 3D coordinate components of P, the mask scale s M is:

std(X(P))2 + std(Y(P))2 + std(Z(P))2, (2) where std( ) denotes the standard variation of a set of scalars. Since these scales are computed in 3D space, they are generally consistent across different views. SAGA uses the 3D scales for multi-granularity segmentation but realizes it in a more efficient way.

3.2 Overall Pipeline The main components of SAGA are shown in Figure 2. Given a pre-trained 3D-GS model G, SAGA attaches a Gaussian affinity feature fg RD for each 3D Gaussian g in G. D denotes the feature dimension. To handle the inherent multi-granularity ambiguity of 3D promptable segmentation, SAGA employs a soft scale gate mechanism to project these features into different scale-gated feature subspaces for various scales s.

To train the affinity features, SAGA extracts a set of multigranularity masks MI = {Mi I {0, 1}HW |i = 1, ..., NI} for each image I in the training set I with SAM. H, W denotes the height and width of I respectively. NI is the number of extracted masks. For each mask Mi I, its 3D physical scale s Mi I is calculated using the depth predicted by G with the camera pose, as shown in Equation (2). Subsequently, SAGA employs a scale-aware contrastive learning strategy (Section 3.4) to distill the multi-granularity segmentation ability embedded in multi-view 2D masks into the scalegated affinity features. After training, at given scales, the affinity feature similarities between two Gaussians indicate whether they belong to the same 3D target. During inference (Section 3.5), given a specific viewpoint, SAGA converts 2D visual prompts (points with scales) into corresponding 3D scale-gated query features to segment the 3D target by evaluating feature similarities with 3D affinity features. Additionally, with well-trained affinity features, 3D scene decomposition is achievable through simple clustering. Moreover, by integrating with CLIP, SAGA can perform open-vocabulary segmentation (see Section A.3 of the supplement) without requiring language fields.

3.3 Gaussian Affinity Feature At the core of SAGA is the Gaussian affinity features F = {fg | g G}, which are learned from the multi-view 2D masks extracted by SAM. To tackle the inherent multigranularity ambiguity in promptable segmentation, inspired by GARfield (Kim et al. 2024), we introduce a scale-gate mechanism to split the feature space into different subspaces for various 3D physical scales. Then, a 3D Gaussian can belong to different segmentation targets at different granularities without conflict.

Scale-Gated Affinity Features Given a Gaussian affinity feature fg and a specific scale s, SAGA adopts a scale gate to adapt the magnitude of different feature channels accordingly. The scale gate is defined as a mapping S : [0, 1] [0, 1]D, which projects a scale scalar s [0, 1] to its corresponding soft gate vector S(s). To maximize segmentation efficiency, the scale gate adopts a extremely streamlined design, which is a single linear layer followed by a sigmoid function. At the scale of s the scale-gated affinity feature is:

f s g = S(s) fg, (3)

where denotes the Hadamard product. Thanks to the simplicity of the scale-gate mechanism, the time overhead caused by scale changing is negligible. Since all Gaussian affinity features share a common scale gate at scale s, during training, we can first render the affinity features to 2D and then apply the scale gate to the 2D rendered features, i.e.,

i=1 fgp i αgp i

j=1 (1 αgp j ), (4)

Fs(p) = S(s) F(p). (5) During inference, the scale gate is directly applied to the 3D Gaussian affinity features for conducting 3D segmentation.

Figure 2: The architecture of SAGA. Left: SAGA attaches a Gaussian affinity feature to each 3D Gaussian. The magnitude of different affinity feature channels are adjusted by a soft scale gate to handle multi-granularity ambiguity. Right: SAGA distills segmentation ability of SAM into affinity features attached to 3D Gaussians in the 3D-GS model through scale-aware contrastive learning.

Local Feature Smoothing In practice, we find that there are many noisy Gaussians in the 3D space that exhibit unexpectedly high feature similarities with the segmentation target. This may occur for various reasons, such as insufficient training due to small weights in rasterization or incorrect geometry structure learned by 3D-GS. To tackle this problem, we adopt the spatial locality prior of 3D Gaussians. During training, SAGA uses the smoothed affinity feature of a Gaussian g to replace its original feature fg, i.e., fg 1 K P g KNN(g) fg . KNN(g) denotes K-nearest neighbors of g. After training, the affinity feature for each 3D Gaussian is saved as its smoothed feature.

3.4 Scale-Aware Contrastive Learning

As introduced in Section 3.3, each 3D Gaussian g is assigned with an affinity feature fg. To train these features, we employ a scale-aware contrastive learning strategy to distill the pixel-wise correlation information from 2D masks into 3D Gaussians via the differentiable rasterization.

Scale-Aware Pixel Identity Vector To conduct scaleaware contrastive learning, for an image I, we first convert the automatically extracted 2D masks MI to scale-aware supervision signal. For this purpose, we assign a scale-aware pixel identity vector V(s, p) {0, 1}NI to each pixel p in I. The identity vectors reflect the 2D masks that a pixel belong to at specific scales. If two pixels p1, p2 share at least a same mask at a given scale (i.e., V(s, p1) V(s, p2) > 0), they should have similar features at scale s. To obtain V(s, p), we first sort the mask set MI in descending order according to their mask scales and get an ordered mask list OI = (M(1) I , ..., M(NI) I ), where s M(1) I > ... > s M (NI) I . Then, for a pixel p, when s M(i) I < s, the i-th

entry of V(s, p) is set to M(i) I (p). When s M(i) I s, the i-th

entry of V(s, p) equals to 1 only if M(i) I (p) = 1 and all smaller masks in {M(j) I | s s M(j) I < s M(i) I } equals to 0 at pixel p. Formally, we have:

( M(i) I (p) if s M(i) I < s or C(p)

0 otherwise

C(p) M {M(j) I | s s M(j) I < s M(i) I }, M(p) = 0

(6) This assignment of pixel identity vectors is based on the fact that if a pixel belongs to a specific mask at a given scale, it will continue to belong to that mask at larger scales.

Loss Function We adapt the correspondence distillation loss (Hamilton et al. 2022) for training the scale-gated Gaussian affinity features. Concretely, for two pixels p1, p2 at a given scale s, their mask correspondence is given by:

Corrm(s, p1, p2) = 1(V(s, p1) V(s, p2)), (7)

where 1( ) is the indicator function, which is equal to 1 when the input greater than or equal to 0. The feature correspondence between two pixels is defined as the cosine similarity between their scale-gated features:

Corrf(s, p1, p2) = Fs(p1), Fs(p2) . (8)

The correspondence distillation loss Lcorr(p1, p2) between two pixels is given by1:

Lcorr(s, p1, p2) = (1 2 Corrm(s, p1, p2)) max(Corrf(s, p1, p2), 0) (9)

1The feature correspondence is clipped at 0 to stabilize training. Please refer to Hamilton et al. (2022) for more details.

Feature Norm Regularization During training, the 2D features are obtained by rendering with 3D affinity features. This indicates a misalignment between 2D and 3D features. As revealed in Equation (4), a 2D feature is a linear combination of multiple 3D features, each with distinct directions. In such situations, SAGA may show good segmentation ability on the rendered feature map but perform poorly in 3D space. This motivates us to introduce a feature norm regularization. Concretely, during rendering the 2D feature map, the 3D features are first normalized as unit vectors, i.e.,

fgp i ||fgp i ||2 αgp i

j=1 (1 αgp j ). (10)

Accordingly, ||F(p)||2 ranges in [0, 1]. When 3D features along a ray are perfectly aligned, ||F(p)||2 = 1. Thus, we impose a regularization on the rendered feature norm:

Lnorm(p) = 1 ||F(p)||2 (11)

With the feature norm regularization term, for an iteration of training, the loss of SAGA is defined as:

(p1,p2) δ(I) δ(I) Lcorr(p1, p2) + X

p δ(I) Lnorm(p),

(12) where δ(I) denotes the set of pixels within the image I.

Additional Training Strategy During the training, an unavoidable issue is the data imbalance, reflected by: 1) Most pixel pairs keep to be positive or negative regardless of scale variations, making the learned feature insensitive to scales; 2) The majority of pixel pairs shows negative correspondence, resulting in feature collapse; 3) Large targets that occupy more pixels in images have more effect on the optimization, leading to bad performance of segmenting small targets. We tackle this problem by resampling the pixelpairs and re-weighting the loss function for different samples. Please refer to Section A.1 of the appendix.

3.5 Inference With well-trained Gaussian affinity features, SAGA can conduct various segmentation tasks in the 3D space. For promptable segmentation, SAGA takes 2D point prompts at specific view and the scale as input. Then, SAGA segments the 3D target by matching scale-gated 3D Gaussian affinity features with the 2D query features selected from the rendered feature map according to prompt points. For automatic scene decomposing, SAGA employs HDBSCAN to cluster the affinity features directly in the 3D space2. Additionally, we design a vote-based segmentation mechanism to integrate SAGA with CLIP for conducting open vocabulary segmentation (see Section A.3 of the supplement).

4 Experiments In this section, we demonstrate the effectiveness of SAGA both quantitatively and qualitatively. For implementation details and experiment settings, please refer to Section A.2.

2For efficiency, SAGA uniformly selects 1% of the Gaussians from the 3D-GS model for clustering.

Method m Io U (%) m Acc (%)

NVOS (Ren et al. 2022) 70.1 92.0 ISRF (Goel et al. 2023) 83.8 96.4 SA3D (Cen et al. 2023) 90.3 98.2 Omni Seg3D (Ying et al. 2024) 91.7 98.4

Gau Group (Ye et al. 2024) 85.6 97.3 SA3D-GS (Cen et al. 2024) 92.2 98.5 SAGA (ours) 92.6 98.6

Table 1: Results on NVOS dataset.

4.1 Datasets For promptable segmentation experiments, we utilize two datasets: NVOS (Ren et al. 2022) and SPIn-Ne RF(Mirzaei et al. 2023). The former is derived from the LLFF dataset (Mildenhall et al. 2019) and the latter is a combination of subsets of data from established Ne RF-related datasets (Mildenhall et al. 2019, 2020; Lin et al. 2022; Knapitsch et al. 2017; Fridovich-Keil et al. 2022). For open-vocabulary segmentation experiments, we adopt the 3D-OVS dataset (Liu et al. 2023a). For qualitative analysis (Figure 3 and 4), we employ various datasets including LLFF (Mildenhall et al. 2019), MIP-360 (Barron et al. 2022), Tanks&Temple (Knapitsch et al. 2017), and Replica (Straub et al. 2019). These datasets encompass indoor and outdoor scenes, forward-facing and 360-degree scenes, as well as synthetic and real scenes.

4.2 Quantitative Results NVOS Dataset As shown in Table 1, SAGA outperforms previous segmentation approaches for both 3D-GS and other radiance fields, i.e., +0.4 m Io U over the previous SOTA SA3D-GS and +0.9 m Io U over the Omni Seg3D.

SPIn-Ne RF Dataset Results on the SPIn-Ne RF dataset can be found in Table 2. SA3D performs on par with the previous SOTA Omni Seg3D. The minor performance degradation is attributed to sub-optimal geometry learned by 3D-GS. For example, 3D-GS models the reflection effects with numerous outlier Gaussians that are not aligned with the exact geometry of the object. Excluding these Gaussians results in empty holes in the segmentation mask for certain views, while including them introduces noise in other views. Nevertheless, we believe the segmentation accuracy of SAGA can meet most requirements.

Open-Vocabulary Semantic Segmentation As shown in Table 3, SAGA demonstrates superior results across all scenes in the 3D-OVS dataset. For more details about openvocabulary segmentation, please refer to Section A.3.

Time Consumption Analysis In Table 4, we reveal the time consumption of SAGA and compare it with existing promptable segmentation methods in radiance fields. ISRF trains a feature field by mimicking the 2D multi-view visual features extracted by DINO (Caron et al. 2021), thus enjoys faster convergence speed. However, this leads to less accurate segmentation, necessitating extensive post-processing.

Figure 3: Qualitative results of SAGA across different scenes. We provide both the targets segmented via 2D point prompts and the segment everything results.

Method m Io U (%) m Acc (%)

MVSeg (Mirzaei et al. 2023) 90.9 98.9 SA3D (Cen et al. 2023) 92.4 98.9 Omni Seg3D (Ying et al. 2024) 94.3 99.3

Gau Group (Ye et al. 2024) 86.5 98.9 SA-GS (Hu et al. 2024) 89.9 98.7 SA3D-GS (Cen et al. 2024) 93.2 99.1 SAGA (ours) 93.4 99.2

Table 2: Results on SPIn-Ne RF dataset.

Method bed bench room sofa lawn mean

LERF 73.5 53.2 46.6 27.0 73.7 54.8 3D-OVS 89.5 89.3 92.8 74.0 88.2 86.8 Lang Splat 92.5 94.2 94.1 90.0 96.1 93.4 N2F2 93.8 92.6 93.5 92.1 96.3 93.9 SAGA 97.4 95.4 96.8 93.5 96.6 96.0

Table 3: Results on 3D-OVS dataset (m Io U).

Method Training Inference

SA3D (Cen et al. 2023) - 45 s SA-GS (Hu et al. 2024) - 15 s

ISRF (Goel et al. 2023) 2.5 mins 3.3 s Omni Seg3D (Ying et al. 2024) 15 40 mins 50 100 ms GARField (Kim et al. 2024) 20 60 mins 30 70 ms SAGA (ours) 10 40 mins 2 5 ms

Table 4: Time consumption comparison.

Figure 4: SAGA can maintain the high frequency texture details captured by 3D-GS. We reveal the inherent structure of these details by shrinking the Gaussians by 60%.

SA3D and SA-GS employ an iterative mask refinement pipeline, which eliminates the need for training but incurs significant inference time consumption. Compared to methods that distill segmentation capabilities from SAM masks, such as Omni Seg3D and GARField, SAGA demonstrates much faster inference speed and comparable training speed.

4.3 Qualitative Results Figure 3 shows that SAGA achieves fine-grained segmentation at various scales across different scenes. The compact Gaussian affinity features enable scene decomposition through simple clustering in the 3D-GS model. The needle-like artifacts on the edges of segmented targets are due to 3D-GS overfitting multi-view RGB without object awareness. Building upon 3D-GS, which can capture highfrequency texture details, SAGA can effectively segment thin, fine-grained structures, as shown in Figure 4. By shrinking the 3D Gaussians, we reveal the underlying structural modeling capabilities of 3D-GS and demonstrate the completeness of the segmentation results. Since GARField lacks quantitative results on the NVOS and SPIn-Ne RF datasets, we provide qualitative comparisons in Section A.6 to highlight the effectiveness of SAGA s learned features.

4.4 Ablation Study Local Feature Smoothing (LFS) & Feature Norm Regularization (FNR) Both LFS and FNR impose constraints on the Gaussian affinity features. We present visualization results to illustrate their roles. In Figure 5, when segmenting a 3D object with a cosine similarity threshold of 0.75, without the feature smoothing, the result shows many false positives, which reveals that outliers are primarily eliminated by the feature smoothing op-

Figure 5: Ablation study on effects of local feature smoothing (Smooth) and feature norm regularization (Feature Norm). Outliers are primarily eliminated through local feature smoothing. Feature norm regularization helps features of inner Gaussians align better with those of the surface.

Figure 6: Failure cases of SAGA. The targets of interest are labeled by red border.

eration. Unlike local feature smoothing, feature norm regularization primarily impacts the Gaussians within objects. When raising the similarity score threshold to 0.95, the apple segmented by SAGA with feature norm regularization remains intact, while the one without it quickly becomes translucent. This phenomenon supports our assumption that 3D features are not perfectly aligned with 2D features, as introduced in Section 3.4. Imposing feature norm regularization helps align the affinity features of 3D Gaussians by pulling the features along a ray in the same direction.

5 Limitation

SAGA learns the affinity features from multi-view 2D masks extracted by SAM. This makes SAGA hardly segment objects that are not appeared in these masks. As shown in Figure 6, this limitation is particularly evident when the target of interest is small. Enhancing the generalization ability of SAGA to unrecognized targets during the automatic extraction stage is a promising direction.

6 Conclusion

In this paper, we propose SAGA, a 3D promptable segmentation method for 3D Gaussian Splatting (3D-GS). SAGA injects the segmentation capability of SAM into Gaussian affinity features for all 3D Gaussians in a 3D-GS model, endowing them with a new property towards segmentation. To preserve the multi-granularity segmentation ability of SAM and the efficiency of 3D-GS, SAGA introduces a lightweight scale-gate mechanism, which adapts the affinity features according to different 3D physical scales with minimal computation overhead. After training, SAGA can achieve real-time fine-grained 3D segmentation. Comprehensive experiments are conducted to demonstrate the effectiveness of SAGA. As one of the first methods addressing promptable segmentation in 3D-GS, the simplicity and effectiveness of SAGA pave the way for future advancements in this field.

Acknowledgments

This work was supported by NSFC 62322604, NSFC 62176159, Shanghai Municipal Science and Technology Major Project 2021SHZDZX0102.

References Barron, J. T.; Mildenhall, B.; Verbin, D.; Srinivasan, P. P.; and Hedman, P. 2022. Mip-Ne RF 360: Unbounded Anti Aliased Neural Radiance Fields. In CVPR. Bhalgat, Y.; Laina, I.; Henriques, J. a. F.; Vedaldi, A.; and Zisserman, A. 2023. Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion. In Neur IPS. Bhalgat, Y.; Laina, I.; Henriques, J. F.; Zisserman, A.; and Vedaldi, A. 2024. N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields. ar Xiv:2403.10997. Bing, W.; Chen, L.; and Yang, B. 2023. DM-Ne RF: 3D Scene Geometry Decomposition and Manipulation from 2D Images. In ICLR. Boykov, Y. Y.; and Jolly, M.-P. 2001. Interactive graph cuts for optimal boundary & region segmentation of objects in ND images. In ICCV. Caron, M.; Touvron, H.; Misra, I.; J egou, H.; Mairal, J.; Bojanowski, P.; and Joulin, A. 2021. Emerging Properties in Self-Supervised Vision Transformers. In ICCV. Cen, J.; Fang, J.; Zhou, Z.; Yang, C.; Xie, L.; Zhang, X.; Shen, W.; and Tian, Q. 2024. Segment Anything in 3D with Radiance Fields. ar Xiv:2304.12308. Cen, J.; Zhou, Z.; Fang, J.; Yang, C.; Shen, W.; Xie, L.; Jiang, D.; Zhang, X.; and Tian, Q. 2023. Segment Anything in 3D with Ne RFs. In Neur IPS. Chen, A.; Xu, Z.; Geiger, A.; Yu, J.; and Su, H. 2022a. Tenso RF: Tensorial Radiance Fields. In ECCV. Chen, X.; Tang, J.; Wan, D.; Wang, J.; and Zeng, G. 2023. Interactive Segment Anything Ne RF with Feature Imitation. ar Xiv:2305.16233. Chen, X.; Zhao, Z.; Zhang, Y.; Duan, M.; Qi, D.; and Zhao, H. 2022b. Focalclick: Towards practical interactive image segmentation. In CVPR. Fan, Z.; Wang, P.; Jiang, Y.; Gong, X.; Xu, D.; and Wang, Z. 2023. Ne RF-SOS: Any-View Self-supervised Object Segmentation on Complex Scenes. In ICLR. Fang, J.; Yi, T.; Wang, X.; Xie, L.; Zhang, X.; Liu, W.; Nießner, M.; and Tian, Q. 2022. Fast Dynamic Radiance Fields with Time-Aware Neural Voxels. In SIGGRAPH Asia 2022 Conference Papers. Fridovich-Keil, S.; Yu, A.; Tancik, M.; Chen, Q.; Recht, B.; and Kanazawa, A. 2022. Plenoxels: Radiance Fields without Neural Networks. In CVPR. Fu, X.; Zhang, S.; Chen, T.; Lu, Y.; Zhu, L.; Zhou, X.; Geiger, A.; and Liao, Y. 2022. Panoptic Ne RF: 3D-to-2D Label Transfer for Panoptic Urban Scene Segmentation. In 3DV. Goel, R.; Sirikonda, D.; Saini, S.; and Narayanan, P. 2023. Interactive segmentation of radiance fields. In CVPR.

Grady, L. 2006. Random walks for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. Gulshan, V.; Rother, C.; Criminisi, A.; Blake, A.; and Zisserman, A. 2010. Geodesic star convexity for interactive image segmentation. In CVPR. Guo, H.; Zhu, H.; Peng, S.; Wang, Y.; Shen, Y.; Hu, R.; and Zhou, X. 2024. SAM-guided Graph Cut for 3D Instance Segmentation. ar Xiv:2312.08372. Hamilton, M.; Zhang, Z.; Hariharan, B.; Snavely, N.; and Freeman, W. T. 2022. Unsupervised Semantic Segmentation by Distilling Feature Correspondences. In ICLR. Hedman, P.; Srinivasan, P. P.; Mildenhall, B.; Reiser, C.; Barron, J. T.; and Debevec, P. 2024. Baking Neural Radiance Fields for Real-Time View Synthesis. IEEE TPAMI. Hu, X.; Wang, Y.; Fan, L.; Fan, J.; Peng, J.; Lei, Z.; Li, Q.; and Zhang, Z. 2024. SAGD: Boundary-Enhanced Segment Anything in 3D Gaussian via Gaussian Decomposition. ar Xiv:2401.17857. Kerbl, B.; Kopanas, G.; Leimk uhler, T.; and Drettakis, G. 2023. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM TOG. Kerr, J.; Kim, C. M.; Goldberg, K.; Kanazawa, A.; and Tancik, M. 2023. Lerf: Language embedded radiance fields. In ICCV. Kim, C. M.; Wu, M.; Kerr, J.; Tancik, M.; Goldberg, K.; and Kanazawa, A. 2024. GARField: Group Anything with Radiance Fields. In CVPR. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A. C.; Lo, W.- Y.; et al. 2023. Segment anything. In ICCV. Knapitsch, A.; Park, J.; Zhou, Q.-Y.; and Koltun, V. 2017. Tanks and Temples: Benchmarking Large-Scale Scene Reconstruction. ACM TOG. Kobayashi, S.; Matsumoto, E.; and Sitzmann, V. 2022. Decomposing Ne RF for Editing via Feature Field Distillation. In Neur IPS. Lin, Y.; Florence, P.; Barron, J. T.; Lin, T.; Rodriguez, A.; and Isola, P. 2022. Ne RF-Supervision: Learning Dense Object Descriptors from Neural Radiance Fields. In ICRA. Lindell, D. B.; Martel, J. N. P.; and Wetzstein, G. 2021. Auto Int: Automatic Integration for Fast Neural Volume Rendering. In CVPR. Liu, K.; Zhan, F.; Zhang, J.; XU, M.; Yu, Y.; Saddik, A. E.; Theobalt, C.; Xing, E.; and Lu, S. 2023a. Weakly Supervised 3D Open-vocabulary Segmentation. In Neur IPS. Liu, Q.; Xu, Z.; Bertasius, G.; and Niethammer, M. 2023b. Simpleclick: Interactive image segmentation with simple vision transformers. In ICCV. Liu, X.; Chen, J.; Yu, H.; Tai, Y.; and Tang, C. 2022. Unsupervised Multi-View Object Segmentation Using Radiance Field Propagation. In Neur IPS. Liu, Y.; Hu, B.; Huang, J.; Tai, Y.-W.; and Tang, C.-K. 2023c. Instance neural radiance field. In ICCV. Liu, Y.; Hu, B.; Tang, C.-K.; and Tai, Y.-W. 2024. SANe RFHQ: Segment Anything for Ne RF in High Quality. In CVPR.

Lyu, W.; Li, X.; Kundu, A.; Tsai, Y.-H.; and Yang, M.-H. 2024. Gaga: Group Any Gaussians via 3D-aware Memory Bank. ar Xiv:2404.07977.

Mildenhall, B.; Srinivasan, P. P.; Cayon, R. O.; Kalantari, N. K.; Ramamoorthi, R.; Ng, R.; and Kar, A. 2019. Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM TOG.

Mildenhall, B.; Srinivasan, P. P.; Tancik, M.; Barron, J. T.; Ramamoorthi, R.; and Ng, R. 2020. Ne RF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV.

Mirzaei, A.; Aumentado-Armstrong, T.; Derpanis, K. G.; Kelly, J.; Brubaker, M. A.; Gilitschenski, I.; and Levinshtein, A. 2023. SPIn-Ne RF: Multiview Segmentation and Perceptual Inpainting with Neural Radiance Fields. In CVPR.

M uller, T.; Evans, A.; Schied, C.; and Keller, A. 2022. Instant neural graphics primitives with a multiresolution hash encoding. ACM TOG.

Niemeyer, M.; and Geiger, A. 2021. GIRAFFE: Representing Scenes As Compositional Generative Neural Feature Fields. In CVPR.

Qin, M.; Li, W.; Zhou, J.; Wang, H.; and Pfister, H. 2024. Lang Splat: 3D Language Gaussian Splatting. In CVPR.

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I. 2021. Learning Transferable Visual Models From Natural Language Supervision. In ICML.

Ren, Z.; Agarwala, A.; Russell, B. C.; Schwing, A. G.; and Wang, O. 2022. Neural Volumetric Object Selection. In CVPR.

Rother, C.; Kolmogorov, V.; and Blake, A. 2004. Grab Cut : interactive foreground extraction using iterated graph cuts. ACM TOG.

Siddiqui, Y.; Porzi, L.; Bul o, S. R.; M uller, N.; Nießner, M.; Dai, A.; and Kontschieder, P. 2023. Panoptic lifting for 3d scene understanding with neural fields. In CVPR.

Sofiiuk, K.; Petrov, I. A.; and Konushin, A. 2022. Reviving iterative training with mask guidance for interactive segmentation. In ICIP.

Stelzner, K.; Kersting, K.; and Kosiorek, A. R. 2021. Decomposing 3D Scenes into Objects via Unsupervised Volume Segmentation. ar Xiv:2104.01148.

Straub, J.; Whelan, T.; Ma, L.; Chen, Y.; Wijmans, E.; Green, S.; Engel, J. J.; Mur-Artal, R.; Ren, C.; Verma, S.; Clarkson, A.; Yan, M.; Budge, B.; Yan, Y.; Pan, X.; Yon, J.; Zou, Y.; Leon, K.; Carter, N.; Briales, J.; Gillingham, T.; Mueggler, E.; Pesqueira, L.; Savva, M.; Batra, D.; Strasdat, H. M.; Nardi, R. D.; Goesele, M.; Lovegrove, S.; and Newcombe, R. 2019. The Replica Dataset: A Digital Replica of Indoor Spaces. ar Xiv:1906.05797.

Sun, C.; Sun, M.; and Chen, H. 2022. Direct Voxel Grid Optimization: Super-fast Convergence for Radiance Fields Reconstruction. In CVPR.

Tschernezki, V.; Laina, I.; Larlus, D.; and Vedaldi, A. 2022. Neural Feature Fusion Fields: 3D Distillation of Self Supervised 2D Image Representations. In 3DV. Vora, S.; Radwan, N.; Greff, K.; Meyer, H.; Genova, K.; Sajjadi, M. S.; Pot, E.; Tagliasacchi, A.; and Duckworth, D. 2022. Nesf: Neural semantic fields for generalizable semantic segmentation of 3d scenes. TMLR. Wizadwongsa, S.; Phongthawee, P.; Yenphraphai, J.; and Suwajanakorn, S. 2021. Ne X: Real-Time View Synthesis With Neural Basis Expansion. In CVPR. Xu, M.; Yin, X.; Qiu, L.; Liu, Y.; Tong, X.; and Han, X. 2023. SAMPro3D: Locating SAM Prompts in 3D for Zero Shot Scene Segmentation. ar Xiv:2311.17707. Yang, Y.; Wu, X.; He, T.; Zhao, H.; and Liu, X. 2023. SAM3D: Segment Anything in 3D Scenes. ar Xiv:2306.03908. Ye, M.; Danelljan, M.; Yu, F.; and Ke, L. 2024. Gaussian Grouping: Segment and Edit Anything in 3D Scenes. ar Xiv:2312.00732. Yin, Y.; Liu, Y.; Xiao, Y.; Cohen-Or, D.; Huang, J.; and Chen, B. 2024. Sai3d: Segment any instance in 3d scenes. In CVPR. Ying, H.; Yin, Y.; Zhang, J.; Wang, F.; Yu, T.; Huang, R.; and Fang, L. 2024. Omni Seg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning. In CVPR. Yu, H.; Guibas, L. J.; and Wu, J. 2022. Unsupervised Discovery of Object Radiance Fields. In ICLR. Zhi, S.; Laidlow, T.; Leutenegger, S.; and Davison, A. J. 2021. In-Place Scene Labelling and Understanding with Implicit Scene Representation. In ICCV. Zou, X.; Yang, J.; Zhang, H.; Li, F.; Li, L.; Wang, J.; Wang, L.; Gao, J.; and Lee, Y. J. 2023. Segment everything everywhere all at once. In Neur IPS.