# se3equivariant_diffusion_policy_in_spherical_fourier_space__c6803283.pdf

SE(3)-Equivariant Diffusion Policy in Spherical Fourier Space

Xupeng Zhu 1 2 Fan Wang 3 Robin Walters 2 Jane Shi 3

Diffusion Policies are effective at learning closed-loop manipulation policies from human demonstrations but generalize poorly to novel arrangements of objects in 3D space, hurting real-world performance. To address this issue, we propose Spherical Diffusion Policy (SDP), an SE(3) equivariant diffusion policy that adapts trajectories according to 3D transformations of the scene. Such equivariance is achieved by embedding the states, actions, and the denoising process in spherical Fourier space. Additionally, we employ novel spherical Fi LM layers to condition the action denoising process equivariantly on the scene embeddings. Lastly, we propose a spherical denoising temporal U-net that achieves spatiotemporal equivariance with computational efficiency. In the end, SDP is end-to-end SE(3) equivariant, allowing robust generalization across transformed 3D scenes. SDP demonstrates a large performance improvement over strong baselines in 20 simulation tasks and 5 physical robot tasks including single-arm and bi-manual embodiments. Code is available at https: //github.com/amazon-science/ Spherical_Diffusion_Policy.

1. Introduction

Diffusion Policy (Chi et al., 2023) has emerged as an effective method for learning closed-loop policies from human demonstration. This success is based on the ability of Diffusion models (Ho et al., 2020) to approximate multi-modal human demonstrations (Mandlekar et al., 2021). A particularly challenging aspect of real-world robotic manipulation, which is often underrepresented in synthetic benchmarks, is that objects may be found in a wide range of 3D poses.

1Work was done as an intern at Amazon Robotics 2Khoury College of Computer Science, Boston, Massachusetts, USA 3Amazon Robotics, Boston, Massachusetts, USA. Correspondence to: Xupeng Zhu <zhu.xup@northeastern.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

Figure 1. SDP enforces that the policy is SO(3) equivariant. Specifically, in the second row, an SO(3) rotation that is applied to the scene leads to an equivalent rotation on the latent spherical Fourier features in the neural networks enc, ϵθ, and on the generated trajectory (blue dots). Fourier features are visualized as spherical signal.

Consider, for example, grasping a dish that is randomly placed in the sink, threading a nut onto a bolt with random orientation, or wiping the curved surface of a car. Diffusion Policy may struggle to attain robust 3D generalization without training on a large amount of costly human demonstrations to exhaust the possible 3D arrangements of the scene.

We propose Spherical Diffusion Policy (SDP), a Fourier space SE(3) equivariant method that automatically adapts to changes in the scene. SDP improves on recent works in equivariant diffusion policy learning which are limited to SO(2)-equivariance (Wang et al., 2024b), equivariant to only single-object transformations (Yang et al., 2024a; Tie et al., 2024), or computationally heavy (Tie et al., 2024). In contrast, our method is light and SE(3) equivariant across multiple objects, allowing it to perform more complicated tasks with less engineering. SDP achieves translational invariance by formulating states and actions in the gripper frame (Chi et al., 2024). Figure 1 illustrates the SO(3) equivariance of the proposed method. If the scene is transformed by a 3D rotation, then the denoised action trajectory

Spherical Diffusion Policy

will be rotated by the same rotation. Since this equivariance is embedded in the neural network, it does not rely on additional data to train and thus achieves high sample efficiency. The equivariance constraints lead to provable SE(3) generalization to transformed scenes.

The contributions of this work are:

1. a novel method, Spherical Diffusion Policy, which is equivariant to 3D rotations and invariant to 3D translations enabling generalization to unseen scenes,

2. a novel spherical Fi LM layer for SO(3) equivariant conditioning,

3. a novel spherical denoising temporal U-net for denoising trajectories with spatiotemporal-equivariance,

4. theoretical validation that SDP is equivariant,

5. empirical validation of SDP through extensive experiments that include 20 simulation and 5 physical tasks including single-arm and bi-manual embodiments.

2. Background

Diffusion Policy is a closed-loop imitation learning method that learns a policy π(s) = a that maps states to action trajectories from expert demonstrations. The states s consist of camera observation o, e.g. images or voxels or point clouds, and the end-effector s 6Do F pose (3D translation and 3D rotation) e T , e R and aperture egrip. The action a specifies the 6Do F pose a T , a R and gripper aperture agrip. The policy takes an input a history of h states St = [st, st 1, . . . , st h+1]. The output is an action sequence of r actions At = [at, at+1, . . . , at+r 1].

Diffusion Policy (Chi et al., 2023) leverages Diffusion Models (Ho et al., 2020; Song et al., 2021) to learn from multimodal human demonstrations (Mandlekar et al., 2021). Diffusion policy infers actions by sampling AK t from a uniform Gaussian noise, then performing K iterations of denoising, producing AK t , AK 1 t , .., A0 t. The final iterate A0 t is the output action. The denoising process is defined by:

Ak 1 t = α Ak t γϵθ(St, Ak t , k)+z , z N(0, σ2I) (1)

where ϵθ(St, Ak t , k) is a learnable denoising function, parameterized by θ, that estimates the noise ϵk based on the state St, the noisy action Ak t , and the step k. The parameters α, γ, σ define the noise schedule and functions of the denoising step k. Finally, the denoising function is trained to predict the noise added to the expert action:

L = ϵθ(St, A0 t + ϵ, k) ϵ 2 . (2)

Equivariance describes the property of a function which commutes with the transformations of a symmetry group

G: ρout(g)f(x) = f ρin(g)x , for all g G. Here, the ρs denote group representations, mapping each group element to an invertible matrix (Serre et al., 1977). The 2D special orthogonal group SO(2) describes planar rotations and its subgroup Cn discretizes SO(2) into n rotations. Similarly, SO(3) describes 3D rotations. We denote the group of 3D translations T(3). The Special Euclidian group SE(3) = SO(3) T(3) includes both 3D rotations and translations. For any group, the trivial representation ρ0 assigns the identity matrix ρ0(g) = I to each group element. This makes invariance a special case of equivariance where the output representation ρout = ρ0. For SO(3), there are higher-dimensional representations ρ1, ρ2, . . . that will be introduced later. Representations can be combined by direct sum ρ(g) = ρ (g) ρ (g), where ρ (g) and ρ (g) are diagonal blocks in ρ(g).

Equivariant Policy Learning assumes the policy is equivariant π(g S) = ga, g G, where G could be SO(2) group or SE(3) group. One way to achieve equivariance is by recognizing and modeling an equivariant function using equivariant neural networks. (Ryu et al., 2024) states that for Brownian Diffusion on the SE(3) manifold, if the target function π(S) = a is equivariant, then the denoising function ϵθ is also equivariant: ϵθ(g S, g A, k) = gϵθ(S, A, k). Equi Diff (Wang et al., 2024b) extends this open-loop equivariance (Ryu et al., 2024) into closed-loop setups, but it is limited to SO(2) equivariance. For an additional introduction, see Appendix C.1.

Another way to achieve equivariance is by canonicalizing the input S and output a of a neural network (Zeng et al., 2022; Wang et al., 2021; Jia et al., 2023; Chi et al., 2024). For example, if a is a 3D translation, then canonicalizing S involves translating it inversely so that the action is at the origin: Scan = S a, acan = a a, a T(3). Intuitively, canonicalization eliminates the transformation applied to the state and action by always evaluating the state in the canonicalized view. Refer to Appendix C.2 for proof.

Spherical Harmonics (SH) are functions on the sphere Y m l : S2 R which give an orthonormal basis for the function space L2(S2, R). They are indexed by degree l Z 0 and order l m l, m Z. A spherical function in spatial space can be transformed into the frequency domain by a spherical Fourier transform: F : f 7 {cm l }, where cm l are Fourier coefficients. Inversely, the inverse spherical Fourier transform F 1 converts the Fourier coefficients to the spatial value: f(u) = P l=0 P

m cm l Y m l (u). Spherical functions and SH are SO(3) steerable and thus suitable for SO(3)-equivariant networks. Essentially, a rotation of f in spatial space is equivalent to a rotation of cm l in the frequency domain by the Wigner D-matrices Dl m, which is orthogonal. That is, f = g f, g SO(3) is equivalent to cn l = P

m Dl mn(g)cm l , where cn l are Fourier coefficients

Spherical Diffusion Policy

of f . For example, degree 0 (ρ0) Fourier coefficients c0 R are scalars that are invariant to rotation, and degree 1 (ρ1) Fourier coefficients c1 R3 are 3D vectors with Wigner Dmatrix given by a standard 3D rotation matrix. A Spherical Fourier signal up to degree L has (L + 1)2 coefficients (Cohen et al., 2018; Bonev et al., 2023). SDP leverages this compact representation. Convolving two sets of Spherical Fourier signals (Cohen et al., 2018; Klee et al., 2023) leads to a signal over SO(3), which has PL l (2l +1)2 coefficients, as adopted in ET-SEED (Tie et al., 2024).

Equiformer (Liao & Smidt, 2023) and Equiformer V2 (Liao et al., 2024) are SE(3) equivariant graph neural networks (GNN) (Passaro & Zitnick, 2023). In contrast to conventional GNNs that treat each node in the graph as a scalar, Equiformer attaches spherical features to each node. These features are compactly approximated by truncated Fourier coefficients, up to degree l L. Messages are aggregated from neighbor nodes in the graph through the edges by equivariant graph attention. This is followed by an equivariant spherical linear and activation layer. The spherical linear layer treats degree l Fourier coefficients as high dimensional vectors to perform a linear mapping in each degree separately. The spherical activation layer (Geiger & Smidt, 2022) performs inverse Fourier transform, then performs conventional activation point-wise on the sphere, and lastly converts the activation back to Fourier coefficients.

3. Related Works

Closed-loop Robot Policy Imitation Learning learns robot skills from human demonstrations through machine learning. Though it is a straightforward and general framework, facing multiple challenges. One challenge is the error compounding effect where the action prediction error causes future states to diverge from the training states and further exacerbate the next action prediction (Ke et al., 2021). To combat this, action chunking (Lai et al., 2022; Mandlekar et al., 2021; Chi et al., 2023; Zhao et al., 2023b) proposes predicting and executing a trajectory of actions instead of one step of action. Another challenge is to learn from multi-modal human demonstrations. Multiple methods are proposed to fit a multi-modal policy, including Gaussian Mixture Model (Mandlekar et al., 2021; Zhu et al., 2022b), Variational Auto Encoder (Zhao et al., 2023b; Mousavian et al., 2019), Energy-Based Models (Implicit Models) (Florence et al., 2022), and Diffusion Models (Janner et al., 2022; Pearce et al., 2023; Chi et al., 2023). Based on (Chi et al., 2023), this work leverages additional inductive bias equivariance, to achieve significantly better performance.

Equivariance on Robot Learning Robotic policies operate in the 3D world, sharing rich symmetries. (Zhu et al., 2022a; Huang et al., 2022; Zhu et al., 2023; Hu et al., 2024)

investigated equivariance in the grasp learning. (Wang et al., 2021; Huang et al., 2024c; Simeonov et al., 2022; Zhao et al., 2023a; Ryu et al., 2024; Huang et al., 2024a; Gao et al., 2024; Zhu et al., 2025b) developed equivariant openloop policies. (Van der Pol et al., 2020; Wang et al., 2022b;a; Jia et al., 2023; Liu et al., 2023; Wang et al., 2024c;b; Yang et al., 2024b;a) verified effectiveness of equivariance in closed-loop agent. Among these works, (Zhu et al., 2022a; Zhao et al., 2023a; Jia et al., 2023; Liu et al., 2023; Huang et al., 2024c; Wang et al., 2024b; Zhu et al., 2025a) utilize discrete equivariance that suffers from discretization error. On the other hand, (Ryu et al., 2024; Huang et al., 2024a; Gao et al., 2024; Hu et al., 2024; Zhu et al., 2025b) leverages continuous equivariance but is limited to open-loop settings. Moreover, Equi Bot (Yang et al., 2024a) are limited to degree l = 1 representation that suppresses rich information, and ET-SEED(Tie et al., 2024) uses heavy SO(3) irreducible representation that needs two-stage inference to alleviate computation burden. Furthermore, (Yang et al., 2024a;b; Tie et al., 2024) requires a segmentation pipeline engineered for each task to exclude everything but one object in the workspace. In contrast, our work is the first to leverage continuous and compact spherical Fourier features to achieve a SE(3) equivariant, end-to-end, and computationally efficient closed-loop policy.

Diffusion Models and Equivariant Diffusion Models Diffusion Models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021) are probabilistic generative models that demonstrated a strong capability modeling multi-modal distribution. Such capability is achieved by iteratively removing noise from an initial sample randomly drawn from an underlying distribution. Equivariance is introduced to diffusion models in (Xu et al., 2022; Hoogeboom et al., 2022; Yim et al., 2023) in the context of molecule generation. Diffusion Models are applied to robotics in open-loop settings (Ke et al., 2024; Ryu et al., 2024; Jiang et al., 2023; Urain et al., 2023; Huang et al., 2024b) and closed-loop settings (Janner et al., 2022; Pearce et al., 2023; Chi et al., 2023; 2024; Ze et al., 2024; Wang et al., 2024a; Liu et al., 2024; Brehmer et al., 2024). The most relevant works on equivariant Diffusion Policy include (Wang et al., 2024b; Zhao et al., 2025; Yang et al., 2024a; Tie et al., 2024), where (Wang et al., 2021; Zhao et al., 2025; Hu et al., 2025) is limited to discretized SO(2) equivariance. (Yang et al., 2024a; Tie et al., 2024) is SO(3) equivariant thus requiring engineering effort to segment everything but the target object, even though, these methods are designed to handle a single object in the scene. Moreover, (Tie et al., 2024) is based on heavy SO(3) irreducible representations and needs 2 stage diffusion process. In contrast, our method is SE(3) equivariant, based on compact yet expressive spherical representations that can end-to-end learning without task-specific engineering effort and generalize to multi-object tasks.

Spherical Diffusion Policy

Figure 2. Method overview. During inference, SDP first embeds state St into a spherical scene feature Ct by the encoder enc. Then, SDTU ϵθ estimates the noise ϵ based on the noisy actions Ak t , step k, and the scene feature Ct. Later, the noise is subtracted from the noisy actions, generating cleaner actions Ak 1 t . This denoising process is performed for K iterations, generating a clean trajectory A0 t.

4.1. Method Overview

The Spherical Diffusion Policy model maps observations to actions π(S) = A. We assume the optimal policy is SE(3) equivariant and enforce this assumption in the model. Specifically, we enforce rotation equivariance π(g S) = g A, g SO(3), and translation invariance π(t S) = A, t T(3). The model is thus SE(3)-equivariant where T(3) acts trivially on the actions.

The rotational equivariance of π is enforced by an equivariant denoising function ϵθ, as proven in (Ryu et al., 2024; Wang et al., 2024b). Specifically, we use an equivariant conditional denoising function ϵθ(S, A + ϵk, k) to estimate the noise for a noisy action A + ϵk, the step k, and state S. We model ϵθ using three components as shown in Figure 2: i) the spherical encoder embeds the state into a multichannel spherical scene feature enc(S) = C, and then ii) a spherical denoising temporal Unet (SDTU) estimates the noise from the noisy action and step, conditioned on the scene feature ϵθ(C, Ak +ϵk, k) using iii) spherical Fi LM (SFi LM) layers to achieve this equivariant conditioning. Since these three components are equivariant, the denoising function is equivariant by composition.

Translational invariance of π is achieved using a relative action formulation (Chi et al., 2024), which canonicalizes the state-action (Zeng et al., 2022) with respect to translations by centering the observation on the gripper and defining action positions relative to the gripper:

Scan,i = (O ei T , ei T ei T , ei R, ei grip)

Acan,i = (Ai T ei T , Ai R, Ai grip).

See Appendix C.2 for proof. For the single-arm setting, i {0} denotes the gripper. Additionally, we propose bi-manual relative action representation. In this case, i

{0, 1} and we canonicalize the state and action to the left i = 0 and the right i = 1 gripper s position.

4.2. Representing State and Action by Spherical Signal

In this section, we propose a spherical representation of the state and action for the policy. There are several advantages of using spherical Fourier features as latent features. First, the truncated spherical Fourier coefficients provide a compact approximation of spherical features and are compatible with SO(3) rotations, rather than computationally heavy SO(3) irreps used in ET-SEED (Tie et al., 2024). Furthermore, higher degree coefficients can represent finer details than Equi Bot(Yang et al., 2024a) which adopted Vector Neuron (Deng et al., 2021) that only supports up to 3D vectors (analogous to type-l = 1), suppressing rich higher degree information in the latent features. For example, vector representations cannot capture spherical distributions with two distinct modes. Lastly, spherical features support equivariance to continuous group SO(3) (continuous rotation), in contrast to discretized group C8 (discretized rotation) in Equi Diff (Wang et al., 2024b) which suffers from discretization error.

The end-effector state e, the action at, and the noise ϵ have the same geometric structure consisting of a 3D position, 3D rotation, and 1D gripper aperture information. We decompose the end-effector data as a 3D position vector, a 3 3 rotation matrix, and a 1D scalar. The rotation matrix can be viewed as 3 column vectors. We represent the position vector by a degree 1 vector, the rotation matrix by 3 degree 1 vectors, and the aperture as a scalar in the trivial representation. That is, e, at, ϵ ρee = ρ4 1 ρ0. Intuitively, the position and rotation matrix are rotated by the rotation matrix corresponding to the rotation of the state, while the gripper aperture stays unchanged.

Spherical Diffusion Policy

Figure 3. Spherical denoising temporal U-net (SDTU). Left: The SDTU ϵθ estimates the noise ϵ, based on the noisy actions Ak t , denoising step index k, and the encoded scene C. The SDTU has a U-net architecture, with 4 spherical down or up convolution blocks. Right: details of a spherical down or up convolution block.

We adopt point clouds as an observation o and treat color information as degree 0 spherical coefficients (same as (Ryu et al., 2024)), as the color is invariant to the point cloud rotation. The point cloud is encoded to a latent vector by a 5layer Res Net (He et al., 2016) encoder enc( ). The encoder is implemented by Equiformer V2 (Liao et al., 2024) which extracts a high-degree spherical signal from the point cloud, see Appendix E for details. The robot state e is concatenated to the output of the encoder yielding C.

4.3. Spherical Denoising Temporal U-net

The spherical denoising temporal U-net ϵθ infers the noise from the noisy action Ak + ϵk, denoising step index k, and state embedding C as ϵθ(C, Ak + ϵk, k). The vector C encodes the state in spherical Fourier space up to degree L. The input Ak + ϵk and the output are spherical signals in ρee, as introduced in Section 4.2. The denoising step index is encoded using sinusoidal embeddings, treated as degree 0 features.

The SDTU is a 1D U-net with spherical Fourier features that are spatiotemporal equivariant. The temporal equivariance is achieved using 1D convolution along the time dimension t, as proposed in Diffuser (Janner et al., 2022). We incorporate an additional spherical Fourier dimension in the latent features to achieve spatial equivariance (SO(3) equivariance). This equivariance is enforced by mixing channel temporal convolution on each degree l of the spherical Fourier coefficient:

i in hi l,m,jwi,o l,j t, (3)

where i, o indexing the input and the output channels respectively, l, m is the degree and order of the spherical Fourier coefficient h, in denotes all input channels. Subscript j indexes the time in the prediction trajectories.

Proposition 4.1. The mixing channel temporal convolution

in Equation. 3 is SO(3) equivariant:

Dl mn(g)ho l,m,t = X

i in Dl mn(g)hi l,m,jwi,o l,j t. (4)

For proof see Appendix A.1, the proof essentially follows Schur s lemma (Schur, 1905), which states that any linear combination of SO(3) Fourier features is equivariant. This convolution is followed by spherical activation (Cohen et al., 2018; Geiger & Smidt, 2022) for expressiveness. Stride and transposed convolution are used for downand upsampling in the U-net, as in Diffuser (Janner et al., 2022). Spherical Fi LM layers are adopted, allowing for equivariant conditioning, and are described in the next section. Figure 3 summarizes the SDTU.

4.4. Spherical Fi LM Conditioning Layer

We propose equivariant spherical Fi LM (SFi LM) layers to extend the Feature-wise Linear Modulation (Fi LM) layer (Perez et al., 2018) used by Diffuser (Janner et al., 2022) into the spherical Fourier domain. The condition on sphere C is projected into a scaling condition γ and an offset condition β by equivariant linear layers (Geiger & Smidt, 2022): γ = Γ(C), β = B(C). Then, SFi LM conditions each degree l separately. Specifically, SFi LM treats γl, βl as 2l + 1 dimensional vectors, to modulate the hidden feature hl, by projecting γl onto hl as a scaling condition and adding βl as an offset condition:

SFi LM(hl|γl, βl) = γT l hl hl ||hl|| + βl. (5)

SFi LM supports high degree Fourier coefficients for expressiveness, which differs from Equi Bot (Yang et al., 2024a) that only supports degree 1 which drops rich information in the latent features.

Proposition 4.2. The SFi LM layer in Equation. 5 is SO(3)

Spherical Diffusion Policy

equivariant:

D(g) SFi LM(hl|γl, βl) =

SFi LM D(g)hl|D(g)γl, D(g)βl , g SO(3) (6)

The proposition is proved by the orthogonal property of the Wigner D-matrices D and Schur s lemma (Schur, 1905), please see Appendix A.2 for details.

5. Experiments

5.1. Simulation Experiments

Experimental Settings We conduct simulation experiments using the Mimic Gen (Mandlekar et al., 2023) environment, built on the Mujoco simulator (Todorov et al., 2012), which features diverse tasks that are contact-rich, precise, and long-horizon (see Figure 4). Unlike scripted demonstrations, which are unimodal, or Reinforcement Learning (RL) agent-generated demonstrations, which are Markovian, Mimic Gen generates multi-modal, non-Markovian trajectories from a few human demonstrations (Mandlekar et al., 2021), making it well-suited for benchmarking learning from human demonstrations.

Mimic Gen provides observations in the form of RGBD images from both a front view and an in-hand view, along with a 7-Do F robot state. The RGB images have a resolution of 84 84 3, while RGBD data can be used to reconstruct either 3D colored voxels (843) or colored point clouds (PCD) with 1024 points. For PCD, we exclude table points, following DP3 (Ze et al., 2024). The action space in Mimic Gen consists of a 6-Do F gripper pose and a 1-Do F gripper aperture. Three control modes are used: Absolute Control, which defines the gripper trajectory in the robot frame; Relative Control, which defines it in the current gripper frame; and Velocity Control, which determines the next gripper pose relative to the previous one (Chi et al., 2024).

To evaluate robustness, we modify four Mimic Gen tasks with SE(3) initialization by randomly tilting the table within a defined range and randomly placing objects on the tabletop while keeping the robot base upright. Benchmarking is conducted across three difficulty levels with progressively increasing tilt ranges: [0], [ 15 , 15 ], and [ 30 , 30 ]. Additionally, we also compare various baselines across all 12 original Mimic Gen tasks.

We compare several baselines in our experiments: 1) Equi Diff (Wang et al., 2024b) an SO(2)-equivariant diffusion policy using either voxel or RGB image observations. 2) Diff Po (Chi et al., 2023) the original diffusion policy, employing either a convolutional (-C) or transformer (-T) backbone in the diffusion network. 3) Equi Bot (Yang et al., 2024a) an SO(3)-equivariant diffusion policy with up to degree l = 1 representations. 4) DP3 (Ze et al., 2024) a

diffusion policy based on point-cloud representations. 5) ACT (Zhao et al., 2023b) (Action Chunking Transformer) a model capturing multi-modality via a Variational Autoencoder (VAE). 6) BC-RNN (Mandlekar et al., 2021) a behavioral cloning approach that captures multi-modality using a Gaussian Mixture Model (GMM) and accounts for non-Markovian dynamics via a Recurrent Neural Network (RNN). A relevant baseline, ET-SEED (Tie et al., 2024), is not included because the code was unavailable before the initial submission. Following (Chi et al., 2023; Wang et al., 2024b), we train all baselines using DDPM (Ho et al., 2020) with 100 denoising steps. For details on hyperparameters, see Appendix D. We report the maximum test success rate throughout training, averaging results over 50 rollouts for each of the three seeds.

Results on Tasks with SE(3) Initialization Table 1 shows that SDP outperforms all baselines across all tilting ranges, except for the Coffee 0 task, demonstrating superior sample efficiency. Notably, as the tilting range increases, SDP achieves a more significant relative performance improvement over the baselines. This highlights SDP s strong SE(3) generalization, enabled by its continuous SE(3) equivariance property. However, performance declines for all methods, including SDP, as the tilting range increases. We hypothesize that this drop is caused by pointcloud occlusion and object instability due to gravity, both of which disrupt SE(3) equivariance.

Results on Tasks with SE(2) Initialization Table 2 shows that SDP outperforms all baselines across 10 tasks, except for Coffee and Coffee Preparation. Despite the variations in SE(2), SDP still demonstrates a notable advantage, suggesting that its continuous SE(2) equivariance benefits learning more effectively than the discrete C8 equivariance in Equi Diff. The lower performance of SDP on Coffee and Coffee Preparation may be attributed to the low-resolution point clouds, which struggle to capture fine details such as the slack between the coffee pod and its receptacle potentially hindering precise manipulation.

5.2. Physical Experiments

Experimental Settings We further evaluate the performance of SDP across 5 physical tasks in Figure 5, using a robot station shown in Figure A1. Turn Lever involves manipulating an articulated object, while Push Eraser requires pushing a small eraser. Grasp Box challenges the policy to maintain a closed kinematic chain. Flip Book involves rich contact between the end-effector, the tabletop, and the book. Pack Package is a long horizon task. The observations are captured by two stationary RGBD cameras positioned above the workspace to minimize occlusion. Point clouds with 1024 points are reconstructed from the RGBD images (for

Spherical Diffusion Policy

(a) Coffee 0

(d) Thr. Pc. D2

(e) Square D2

(f) Thread. D2

Figure 4. Mimic Gen tasks with SE(3) initialization ((a)-(c), showing 1 of 4 tasks) and SE(2) initialization ((d)-(f), showing 3 of 12 tasks).

Table 1. Evaluation success rate on 4 Mimic Gen tasks with 3 levels of SE(3) initialization. We train all the baselines on progressively tilted environments with 100 demonstrations. As the degress of SE(3) initialization increases, SDP maintains reasonable performance while the performance of other baselines drop severely. Results averaged over three seeds.

Coffee Three Pc. Assembly Square Threading Average

Method Equ Ctrl Obs 0 15 30 0 15 30 0 15 30 0 15 30 0 15 30

SDP SE(3) Rel PCD 63 54 33 67 49 37 62 38 31 60 53 39 63 49 35 Equi Diff (Wang et al., 2024b) C8 SO(2) Abs Voxel 65 43 29 37 15 8 39 3 3 39 20 10 45 20 13 Diff Po (Chi et al., 2023) N/A Abs RGB 44 23 16 4 2 2 8 0 1 17 10 8 18 9 7 Equi Bot (Yang et al., 2024a) SO(3) Abs PCD 0 1 0 1 1 1 0 1 1 6 4 0 2 2 1

Table 2. Evaluation success rate on 12 Mimic Gen tasks with SE(2) initialization. We train all the baselines with 100 demonstrations. SDP demonstrates the best performance on 10 tasks. Results averaged over three seeds.

Stack Three Pc. Hammer Mug Nut Pick Coffee Average Stack Three Square Threading Coffee Assembly Cleanup Cleanup Kitchen Assembly Place Preparation Success Method Ctrl Obs D1 D1 D2 D2 D2 D2 D1 D1 D1 D0 D0 D1 Rate

SDP Rel PCD 100 98 62 60 63 67 82 54 89 92 73 73 76

Equi Diff (Wang et al., 2024b)

Voxel 99 75 39 39 65 37 70 53 85 67 58 80 64 Equi Diff (Wang et al., 2024b) RGB 93 55 25 22 60 15 65 49 67 74 42 77 54 Diff Po-C (Chi et al., 2023) RGB 76 38 8 17 44 4 52 43 67 55 35 65 42 Diff Po-T (Chi et al., 2023) RGB 51 17 5 11 47 1 48 30 54 31 15 38 29 DP3 (Ze et al., 2024) PCD 69 7 7 12 34 0 54 21 45 16 12 10 24 ACT (Zhao et al., 2023b) RGB 35 6 6 10 19 0 38 23 37 42 7 32 21

Equi Diff (Wang et al., 2024b)

Voxel 95 59 25 33 55 5 64 39 69 53 40 48 49 Equi Diff (Wang et al., 2024b) RGB 75 25 11 11 41 1 49 29 61 44 29 49 35 Diff Po-C (Chi et al., 2023) RGB 81 26 6 13 43 2 43 25 42 42 35 42 33 BC RNN (Mandlekar et al., 2021) RGB 59 12 8 7 37 0 32 19 31 35 21 14 23

(a) Single-arm Turn Lever

(b) Single-arm Push Eraser

(c) Bi-manual Grasp Box

(d) Bi-manual Multi-step Flip Book

(e) Bi-manual Multi-step Pack Package

Figure 5. Five physical robotic manipulation tasks.

Push Eraser we also tested 2048 points). The actions are the 6 Do F gripper poses for single-arm tasks or 12 Do F gripper poses for bi-manual tasks.

Training Dataset from Human Demonstrations We use Gello (Wu et al., 2023) to collect demonstrations with objects initialized in random SE(3) poses. For the singlearm tasks, we collect 30 successful human demonstrations. Additionally, we record an extra 10% demos as the recov-

Spherical Diffusion Policy

Table 3. Success rate (%) of 5 physical experiments over 20 evaluation episodes. The action space and number of training demonstrations are listed under each task. Overall, SDP is 61% better than Equi Diff and 71% better than Diff Po-C. Results from one seed. * using point clouds with 2048 points.

Turn Lever Push Eraser Grasp Box Flip Book Pack Package Avg. 6 Do F 6 Do F 12 Do F 12 Do F 12 Do F Succ. Method 33 Demos 33 Demos 33 Demos 66 Demos 66 Demos Rate

SDP 80 35/ 90* 85 65 70 78 Equi Diff 20 30 35 0 0 17 Diff Po-C 10 10 15 0 0 7

ery demos at specific poses. Similarly, for the bi-manual Grasp Box task, we collect 33 demos at random SE(3) poses. For more challenging bi-manual tasks like Flip Book and Pack Package, we collect 66 demos. Figure A3 visualizes the SE(3) training pose distribution for all five tasks. Further details on the tasks are provided in Appendix B.2.

Results and Discussion We benchmark SDP against the top two baselines, Equi Diff (Wang et al., 2024b) and Diff Po C (Chi et al., 2023), from the simulation experiment (Table 1). We train all three models using DDIM (Song et al., 2021) and inference with 8 denoising steps for all tasks. Each baseline is evaluated on 20 rollouts per physical task, with each rollout initialized using object poses in novel SE(3) poses unseen in the training set. The success rates are summarized in Table 3, with a detailed breakdown of the success rates provided in Table A1. For the Push Eraser task, increasing the PCD resolution to 2048 points enables accurate localization of the eraser, resulting in a performance improvement of over 50%.

SDP significantly outperforms all baseline methods across every task and embodiments, achieving a 61% higher success rate than Equi Diff and a 71% improvement over Diff Po C. These significant gains in sample efficiency and spatial generalization are largely attributed to its inherent SE(3) equivariance. For instance, in the Turn Lever task, SDP successfully locates and rotates a lever that is randomly clamped in 3D space. In comparison, Equi Diff frequently misdirects the gripper to the workspace center, entirely missing the lever, while Diff Po approaches the lever but only hovers nearby without engaging it. Further discussion of common failures can be found in Appendix B.3.

5.3. Ablation Study

Table 4 presents six ablations: 1) Discrete SDP - replaces SDTU that has continuous equivariance with discrete equivariant denoising U-net with Octahedron (cubical) discretization, this is a SE(3) adaption of (Wang et al., 2024b; Zhao et al., 2025). 2) SDP Absolute Action defines actions in the workspace frame instead of SDP s relative action formu-

lation, which defines actions in the current gripper frame. 3) DP3-canonical - DP3 (Ke et al., 2024) with canonicalized observation-action space, by transforming the point cloud and trajectory to the gripper frame. This achieves SE(3)- invariance. 4) Equi Bot Rel. removes SDP s spherical Fourier features and replaces its model with Equi Bot (Yang et al., 2024a), while keeping the relative action formulation. 5) DP3 Absolute Action - (Ke et al., 2024) the original DP3 policy, this baseline ablates relative action, spherical representation, and equivariance. 6) Equi Bot Absolute Action removes both the spherical Fourier features and the relative action formulation, adopting Equi Bot (Yang et al., 2024a) in the absolute action formulation.

The results are shown in Table 4. Discrete SDP trivially modifying (Wang et al., 2024b) to be SE(3) equivariant, sufferers from discretization error thus underperforms SDP. The relative action formulation that achieves 3D translational equivariance, also plays a key role, as its removal in SDP Abs. causes major performance drops, particularly in the Coffee and Square tasks. DP3-canonical leverages SE(3) invariant features, outperforms DP3 that did not leverage, but still underperforms SDP by a large margin, demonstrating the advantage of equivariant features. This matches the finding in (Miller et al., 2020). The Equi Bot Rel. ablations result in significant performance drops across all four tasks, indicating that the spherical Fourier representation is the most critical factor for SDP s performance. Finally, DP3 Abs., Equi Bot Abs. further amplifies the performance degradation, demonstrating that removing both the relative action formulation and spherical Fourier representation is more detrimental than removing either one alone.

Table 4. Ablation study. The relative action space and the spherical representation are critical for the SE(3) generalization, while the latter one is more important. Results from one seed.

Rel. Spher. Equi Coffee Thr. Pc. Square Thread. Avg. Method Act. Rep. variance 15 As. 15 15 15 SR

SDP SE(3) equ. 54 49 38 53 49 Discrete SDP Octahedron 42 16 34 48 35 SDP Abs. SO(3) equ. 18 42 0 44 26 DP3-canonical SE(3) inv. 40 0 8 12 15 Equi Bot Rel. SE(3) equ. 18 2 4 6 8 DP3 Abs. None 20 0 0 4 6 Equi Bot Abs. SO(3) equ. 1 1 1 4 2

5.4. Additional Studies

Performance VS Degree l in spherical Fourier features As shown in Table 5, increasing the degree from 1 to 2 improves performance across all tasks. However, beyond degree 2, performance saturates while computational cost increases. Since SDP is designed for real-time robot control, we select l = 2.

Spherical Diffusion Policy

Table 5. Success rate VS degree l of the spherical Fourier feature. Results from one seed.

Degree l Coffee 15 Thr. Pc. As. 15 Square 15 Threading 15 Avg. SR

3 52 52 66 52 56 2 (SDP) 54 49 38 53 49 1 44 4 10 46 26

Sample Efficiency To assess the impact of training dataset size on performance, we evaluate SDP, Equi Diff, and Diff Po on four Mimic Gen tasks with tilted angles in the range [0, 15 ], using 100, 316, or 1000 demonstrations. Figure 6 summarizes the results, with each point representing the average success rate across all four tasks.

SDP achieves a success rate of 48.5% using only 100 demonstrations, surpassing Equi Diff by 4% while utilizing just one-tenth of the data, indicating a 10 improvement in data efficiency. Similarly, Equi Diff attains a 20% success rate with 100 demonstrations, matching the performance of Diff Po, which requires approximately 300 demonstrations, supporting a 3 gain in data efficiency.

Figure 6. Impact of training dataset size on task success rates: increasing the number of demonstrations from 100 to 316 (a 3 increase) yields an average success rate improvement of 9 12% across four tasks, with initial success rates of 48.5% (SDP), 20.3% (Equi Diff), and 8.8% (Diff Po). At 1,000 demonstrations (10 ), the average success rate saturates for SDP at 60%, while Equi Diff and Diff Po continue to improve, reaching 44.5% and 32.5%, respectively.

The observed performance saturation of SDP beyond 300 demonstrations may be attributed to scene occlusions, as relying solely on agent-view and in-hand cameras can obscure critical areas of the workspace. Additionally, all demonstrations are generated from 10 raw human demonstrations (Mandlekar et al., 2023), so increasing the number of generated demonstrations may not increase diversity in the data. Furthermore, the kinematic constraints of the robot may limit its ability to access certain regions, thereby impacting overall task success.

Inference Speed Table 6 presents the inference time comparing SDP with four other baselines. Diff Po is the fastest (0.09s) while ETSEED is the slowest (29.4s). The inference time of SDP is on the same order of magnitude as that of the best baseline-Diff Po (approximately 5 ), while SDP achieves continuous SE(3) equivariance, significantly better performance than Diff Po and Equi Diff in simulation and physical experiments, and does not require preprocessing.

SDP leverages spherical features by using Equiformer V2 (Liao et al., 2024) and SDTU, which is more lightweight (66 faster inference speed, 32 larger batch size) than the SE(3)-transformer that ET-SEED is based on. Moreover, SDP supports high orders of spherical harmonics, which is more expressive than Vector Neuron. Lastly, SDP achieves continuous SE(3)-equivariance, where Equi Diff enforces discrete C8 SO(2) equivariance.

Table 6. Comparison of inference time. At the costs of 5 solver than Diff Po, SDP achieves continuous SE(3) equivariance and does not need preprocessing. The inference time and the training batch size are tested on a commercial GPU with 24GB RAM.

Inference Pre Batch Equivariance Neural Method Time (s) processing Size Network

SDP 0.44 No 32 SE(3) Equiformer V2, SDTU Diff Po 0.09 No 64 None Convolution Equi Diff 0.14 No 64 C8 SO(2) ESCNN Equi Bot 0.18 Segmentation 64 SE(3) Vector Neuron ET-SEED 29.4 Segmentation 1 SE(3) SE(3)-Transformer

6. Conclusion and Limitations

This paper introduces the Spherical Diffusion Policy (SDP), an SE(3)-equivariant policy that generalizes to 3D scene arrangements using only a few demonstrations. SDP achieves this through three key components: 1) spherical Fourier features, providing compact and precise representations for continuous SO(3) equivariance; 2) spherical Fi LM, enforcing equivariant conditioning; and 3) a spherical denoising temporal U-net, ensuring spatiotemporal equivariant denoising. SDP significantly outperforms strong baselines across simulation and real-world experiments, demonstrating effectiveness in both single-arm and bi-manual embodiments.

One limitation of the proposed method is that it operates in position control, ignoring contact forces, which leads to protective stops in the Flip Book task. An important future direction is to address this by learning a compliant, force-aware policy (Kohler et al., 2024; Hou et al., 2024) in an equivariant manner. Another limitation is the lowresolution point cloud processing in the observation encoder, which struggles to capture fine details, such as these in the Push Eraser task. Using a more efficient Graph Neural Network (Zhao et al., 2021; Luo et al., 2024) could help mitigate this issue.

Spherical Diffusion Policy

Impact Statement

On the bright side, the proposed method enhances spatial generalization for manipulation policies, making it potentially deployable in real-world scenarios to significantly reduce human workload. On the dark side, the method lacks awareness of common scenes, which could result in risky actions such as harming individuals, causing fires, or damaging objects.

Acknowledgments

The authors would like to thank Michael Schultz and Nathan Gere for the help on the physical experiments, Haojie Huang for the help on the simulation environments, Pranay Thangeda and Erica Aduh for the help on the robot platform setups.

Bonev, B., Kurth, T., Hundt, C., Pathak, J., Baust, M., Kashinath, K., and Anandkumar, A. Spherical fourier neural operators: Learning stable dynamics on the sphere. In International conference on machine learning, pp. 2806 2823. PMLR, 2023.

Brehmer, J., Bose, J., De Haan, P., and Cohen, T. S. Edgi: Equivariant diffusion for planning with embodied agents. Advances in Neural Information Processing Systems, 36, 2024.

Chi, C., Feng, S., Du, Y., Xu, Z., Cousineau, E., Burchfiel, B., and Song, S. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023.

Chi, C., Xu, Z., Pan, C., Cousineau, E., Burchfiel, B., Feng, S., Tedrake, R., and Song, S. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. In Proceedings of Robotics: Science and Systems (RSS), 2024.

Cohen, T. S., Geiger, M., K ohler, J., and Welling, M. Spherical cnns. In International Conference on Learning Representations, 2018.

Crooks, W., Vukasin, G., O Sullivan, M., Messner, W., and Rogers, C. Fin ray effect inspired soft robotic gripper: From the robosoft grand challenge toward optimization. Frontiers in Robotics and AI, 3, 2016. ISSN 2296-9144. doi: 10.3389/frobt.2016.00070. URL https://www.frontiersin.org/journals/ robotics-and-ai/articles/10.3389/ frobt.2016.00070.

Deng, C., Litany, O., Duan, Y., Poulenard, A., Tagliasacchi, A., and Guibas, L. J. Vector neurons: A general frame-

work for so (3)-equivariant networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12200 12209, 2021.

Florence, P., Lynch, C., Zeng, A., Ramirez, O. A., Wahid, A., Downs, L., Wong, A., Lee, J., Mordatch, I., and Tompson, J. Implicit behavioral cloning. In Conference on Robot Learning, pp. 158 168. PMLR, 2022.

Gao, C., Xue, Z., Deng, S., Liang, T., Yang, S., Shao, L., and Xu, H. Riemann: Near real-time se (3)-equivariant robot manipulation without point cloud segmentation. ar Xiv preprint ar Xiv:2403.19460, 2024.

Geiger, M. and Smidt, T. e3nn: Euclidean neural networks. ar Xiv preprint ar Xiv:2207.09453, 2022.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020.

Hoogeboom, E., Satorras, V. G., Vignac, C., and Welling, M. Equivariant diffusion for molecule generation in 3d. In International conference on machine learning, pp. 8867 8887. PMLR, 2022.

Hou, Y., Liu, Z., Chi, C., Cousineau, E., Kuppuswamy, N., Feng, S., Burchfiel, B., and Song, S. Adaptive compliance policy: Learning approximate compliance for diffusion guided control. ar Xiv preprint ar Xiv:2410.09309, 2024.

Hu, B., Zhu, X., Wang, D., Dong, Z., Huang, H., Wang, C., Walters, R., and Platt, R. Orbitgrasp: Se (3)-equivariant grasp learning. Co RR, 2024.

Hu, B., Wang, D., Klee, D., Tian, H., Zhu, X., Huang, H., Platt, R., and Walters, R. 3d equivariant visuomotor policy learning via spherical projection. ar Xiv preprint ar Xiv:2505.16969, 2025.

Huang, H., Wang, D., Zhu, X., Walters, R., and Platt, R. Edge grasp network: A graph-based se (3)- invariant approach to grasp detection. ar Xiv preprint ar Xiv:2211.00191, 2022.

Huang, H., Howell, O. L., Wang, D., Zhu, X., Platt, R., and Walters, R. Fourier transporter: Bi-equivariant robotic manipulation in 3d. In The Twelfth International Conference on Learning Representations, 2024a.

Huang, H., Schmeckpeper, K., Wang, D., Biza, O., Qian, Y., Liu, H., Jia, M., Platt, R., and Walters, R. IMAGINATION POLICY: Using generative point cloud models for learning manipulation policies. In 8th Annual

Spherical Diffusion Policy

Conference on Robot Learning, 2024b. URL https: //openreview.net/forum?id=56Izghzjf Z.

Huang, H., Wang, D., Tangri, A., Walters, R., and Platt, R. Leveraging symmetries in pick and place. The International Journal of Robotics Research, 43(4):550 571, 2024c.

Janner, M., Du, Y., Tenenbaum, J., and Levine, S. Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, pp. 9902 9915. PMLR, 2022.

Jia, M., Wang, D., Su, G., Klee, D., Zhu, X., Walters, R., and Platt, R. Seil: simulation-augmented equivariant imitation learning. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 1845 1851. IEEE, 2023.

Jiang, H., Salzmann, M., Dang, Z., Xie, J., and Yang, J. Se (3) diffusion model-based point cloud registration for robust 6d object pose estimation. Advances in Neural Information Processing Systems, 36:21285 21297, 2023.

Ke, L., Wang, J., Bhattacharjee, T., Boots, B., and Srinivasa, S. Grasping with chopsticks: Combating covariate shift in model-free imitation learning for fine manipulation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 6185 6191. IEEE, 2021.

Ke, T.-W., Gkanatsios, N., and Fragkiadaki, K. 3d diffuser actor: Policy diffusion with 3d scene representations. ar Xiv preprint ar Xiv:2402.10885, 2024.

Klee, D., Biza, O., Platt, R., and Walters, R. Image to sphere: Learning equivariant features for efficient pose prediction. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=_2b Dp Atr7PI.

Kohler, C., Srikanth, A. S., Arora, E., and Platt, R. Symmetric models for visual force policy learning. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 3101 3107, 2024. doi: 10.1109/ICRA57147. 2024.10610728.

K ohler, J., Klein, L., and No e, F. Equivariant flows: exact likelihood generative learning for symmetric densities. In International conference on machine learning, pp. 5361 5370. PMLR, 2020.

Lai, L., Huang, A. Z., and Gershman, S. J. Action chunking as policy compression. 2022.

Liao, Y.-L. and Smidt, T. Equiformer: Equivariant graph attention transformer for 3d atomistic graphs. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/ forum?id=Kwm Pf ARg OTD.

Liao, Y.-L., Wood, B. M., Das, A., and Smidt, T. Equiformerv2: Improved equivariant transformer for scaling to higher-degree representations. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum? id=m COBKZmrz D.

Liu, S., Xu, M., Huang, P., Zhang, X., Liu, Y., Oguchi, K., and Zhao, D. Continual vision-based reinforcement learning with group symmetries. In Tan, J., Toussaint, M., and Darvish, K. (eds.), Proceedings of The 7th Conference on Robot Learning, volume 229 of Proceedings of Machine Learning Research, pp. 222 240. PMLR, 06 09 Nov 2023. URL https://proceedings.mlr. press/v229/liu23a.html.

Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., and Zhu, J. Rdt-1b: a diffusion foundation model for bimanual manipulation, 2024. URL https: //arxiv.org/abs/2410.07864.

Luo, S., Chen, T., and Krishnapriyan, A. S. Enabling efficient equivariant operations in the fourier basis via gaunt tensor products. In The Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=mhy QXJ6Js K.

Mandlekar, A., Xu, D., Wong, J., Nasiriany, S., Wang, C., Kulkarni, R., Fei-Fei, L., Savarese, S., Zhu, Y., and Mart ın-Mart ın, R. What matters in learning from offline human demonstrations for robot manipulation. ar Xiv preprint ar Xiv:2108.03298, 2021.

Mandlekar, A., Nasiriany, S., Wen, B., Akinola, I., Narang, Y., Fan, L., Zhu, Y., and Fox, D. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In Conference on Robot Learning, pp. 1820 1864. PMLR, 2023.

Miller, B. K., Geiger, M., Smidt, T. E., and No e, F. Relevance of rotationally equivariant convolutions for predicting molecular properties. ar Xiv preprint ar Xiv:2008.08461, 2020.

Mousavian, A., Eppner, C., and Fox, D. 6-dof graspnet: Variational grasp generation for object manipulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.

Passaro, S. and Zitnick, C. L. Reducing so (3) convolutions to so (2) for efficient equivariant gnns. In International Conference on Machine Learning, pp. 27420 27438. PMLR, 2023.

Pearce, T., Rashid, T., Kanervisto, A., Bignell, D., Sun, M., Georgescu, R., Macua, S. V., Tan, S. Z., Momennejad, I.,

Spherical Diffusion Policy

Hofmann, K., et al. Imitating human behaviour with diffusion models. In The Eleventh International Conference on Learning Representations, 2023.

Perez, E., Strub, F., De Vries, H., Dumoulin, V., and Courville, A. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.

Ryu, H., Kim, J., An, H., Chang, J., Seo, J., Kim, T., Kim, Y., Hwang, C., Choi, J., and Horowitz, R. Diffusion-edfs: Bi-equivariant denoising generative modeling on se (3) for visual robotic manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18007 18018, 2024.

Schur, I. Neue begr undung der theorie der gruppencharaktere. In Sitzungsberichte der K oniglich Preußischen Akademie der Wissenschaften zu Berlin: Jahrgang 1905; Erster Halbband Januar bis Juni, pp. 406 432. Verlag der K oniglichen Akademie der Wissenschaften, 1905.

Serre, J.-P. et al. Linear representations of finite groups, volume 42. Springer, 1977.

Simeonov, A., Du, Y., Tagliasacchi, A., Tenenbaum, J. B., Rodriguez, A., Agrawal, P., and Sitzmann, V. Neural descriptor fields: Se (3)-equivariant object representations for manipulation. In 2022 International Conference on Robotics and Automation (ICRA), pp. 6394 6400. IEEE, 2022.

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256 2265. PMLR, 2015.

Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.

Tie, C., Chen, Y., Wu, R., Dong, B., Li, Z., Gao, C., and Dong, H. Et-seed: Efficient trajectory-level se (3) equivariant diffusion policy. ar Xiv preprint ar Xiv:2411.03990, 2024.

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp. 5026 5033. IEEE, 2012.

Urain, J., Funk, N., Peters, J., and Chalvatzaki, G. Se (3)-diffusionfields: Learning smooth cost functions for joint grasp and motion optimization through diffusion. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 5923 5930. IEEE, 2023.

Van der Pol, E., Worrall, D., van Hoof, H., Oliehoek, F., and Welling, M. Mdp homomorphic networks: Group symmetries in reinforcement learning. Advances in Neural Information Processing Systems, 33:4199 4210, 2020.

Wang, C., Fang, H., Fang, H.-S., and Lu, C. Rise: 3d perception makes real-world robot imitation simple and effective. ar Xiv preprint ar Xiv:2404.12281, 2024a.

Wang, D., Walters, R., Zhu, X., and Platt, R. Equivariant $q$ learning in spatial action spaces. In 5th Annual Conference on Robot Learning, 2021. URL https: //openreview.net/forum?id=IScz42A3i CI.

Wang, D., Jia, M., Zhu, X., Walters, R., and Platt, R. Onrobot learning with equivariant models. In 6th Annual Conference on Robot Learning, 2022a.

Wang, D., Walters, R., and Platt, R. SO(2)-equivariant reinforcement learning. In International Conference on Learning Representations, 2022b.

Wang, D., Hart, S., Surovik, D., Kelestemur, T., Huang, H., Zhao, H., Yeatman, M., Wang, J., Walters, R., and Platt, R. Equivariant diffusion policy. In 8th Annual Conference on Robot Learning, 2024b.

Wang, D., Zhu, X., Park, J. Y., Jia, M., Su, G., Platt, R., and Walters, R. A general theory of correct, incorrect, and extrinsic equivariance. Advances in Neural Information Processing Systems, 36, 2024c.

Wu, P., Shentu, F., Lin, X., and Abbeel, P. GELLO: A general, low-cost, and intuitive teleoperation framework for robot manipulators. In Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition @ Co RL2023, 2023. URL https://openreview. net/forum?id=sse Gcw79Zh.

Xu, M., Yu, L., Song, Y., Shi, C., Ermon, S., and Tang, J. Geodiff: A geometric diffusion model for molecular conformation generation. In International Conference on Learning Representations, 2022.

Yang, J., Cao, Z., Deng, C., Antonova, R., Song, S., and Bohg, J. Equibot: Sim (3)-equivariant diffusion policy for generalizable and data efficient learning. In 8th Annual Conference on Robot Learning, 2024a.

Yang, J., Deng, C., Wu, J., Antonova, R., Guibas, L., and Bohg, J. Equivact: Sim (3)-equivariant visuomotor policies beyond rigid object manipulation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 9249 9255. IEEE, 2024b.

Yim, J., Trippe, B. L., De Bortoli, V., Mathieu, E., Doucet, A., Barzilay, R., and Jaakkola, T. Se (3) diffusion model

Spherical Diffusion Policy

with application to protein backbone generation. In Proceedings of the 40th International Conference on Machine Learning, pp. 40001 40039, 2023.

Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., and Xu, H. 3d diffusion policy. ar Xiv preprint ar Xiv:2403.03954, 2024.

Zeng, A., Song, S., Yu, K.-T., Donlon, E., Hogan, F. R., Bauza, M., Ma, D., Taylor, O., Liu, M., Romo, E., et al. Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. The International Journal of Robotics Research, 41(7):690 705, 2022.

Zhao, H., Jiang, L., Jia, J., Torr, P. H., and Koltun, V. Point transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 16259 16268, 2021.

Zhao, H., Wang, D., Zhu, Y., Zhu, X., Howell, O., Zhao, L., Qian, Y., Walters, R., and Platt, R. Hierarchical equivariant policy via frame transf. ar Xiv preprint ar Xiv:2502.05728, 2025.

Zhao, L., Zhu, X., Kong, L., Walters, R., and Wong, L. L. Integrating symmetry into differentiable planning with steerable convolutions. In The Eleventh International Conference on Learning Representations, 2023a.

Zhao, T. Z., Kumar, V., Levine, S., and Finn, C. Learning fine-grained bimanual manipulation with low-cost hardware. ar Xiv preprint ar Xiv:2304.13705, 2023b.

Zhu, X., Wang, D., Biza, O., Su, G., Walters, R., and Platt, R. Sample efficient grasp learning using equivariant models. Proceedings of Robotics: Science and Systems (RSS), 2022a.

Zhu, X., Wang, D., Su, G., Biza, O., Walters, R., and Platt, R. On robot grasp learning using equivariant models. Autonomous Robots, 2023.

Zhu, X., Klee, D., Wang, D., Hu, B., Huang, H., Tangri, A., Walters, R., and Platt, R. Coarse-to-fine 3d keyframe transporter. ar Xiv preprint ar Xiv:2502.01773, 2025a.

Zhu, X., Qi, Y., Zhu, Y., Walters, R., and Platt, R. Equact: An se(3)-equivariant multi-task transformer for open-loop robotic manipulation, 2025b. URL https://arxiv. org/abs/2505.21351.

Zhu, Y., Joshi, A., Stone, P., and Zhu, Y. Viola: Objectcentric imitation learning for vision-based robot manipulation. In 6th Annual Conference on Robot Learning, 2022b.

Spherical Diffusion Policy

A.1. Proof of Proposition 4.1

Proof. Focusing on the right-hand side of the Equation 4: X

i in Dl mn(g)hi l,m,jwi,o l,j t (7)

= Dl mn(g) X

i hi l,m,jwi,o l,j t (8)

= Dl mn(g)ho l,m,t, (9)

the line 8 is because of Schur s lemma (Schur, 1905), which states that any linear operation of SO(3) irreps acts as on each irreducible subspace is equivariant.

A.2. Proof of Proposition 4.2

Proof. Focusing on the right-hand side of Equation 6:

SFi LM D(g)hl|D(g)γl, D(g)βl (10)

= (D(g)γl)TD(g)hl D(g)hl ||D(g)hl|| + D(g)βl (11)

= γT l D(g)TD(g)hl D(g)hl ||D(g)hl|| + D(g)βl (12)

Because the Wigner D-matrices are orthogonal, we have:

SFi LM D(g)hl|D(g)γl, D(g)βl (14)

= γT l hl D(g)hl

||hl|| + D(g)βl (15)

= D(g)(γT l hl hl ||hl|| + βl) (16)

= D(g) SFi LM(hl|γl, βl), (17)

the line 16 is based on Schur s lemma (Schur, 1905).

B. Physical Experiments and Results

B.1. Physical Experiment Workstation

Our physical robotic workstation, as shown in Figure A1, is composed of two collaborative UR5e manipulators, each equipped with a compliant ray-fin finger(Crooks et al., 2016).Two scene cameras (Real Sense D415) are positioned on either side of the workspace. GELLO controllers (Wu et al., 2023) are utilized to collect human demonstrations for physical robotic manipulation tasks.

B.2. Physical Manipulation Tasks and Training Datasets

We experiment with SDP on five physical tasks, as shown in Figure A2 : (a) Turn Lever involves manipulating an articulated object and (b) Push Eraser requires pushing a small object with a single manipulator; (c) Bi-manual Grasp Box challenges the policy to maintain a closed kinematic chain; (d) Flip Book involves rich contact between the end-effector and the book while transforming book s pose with dexterous complex coordinated manipulation; (e) Pack Package is a long horizon task.

Turn Lever: An expert demonstrator moves one ray-fin finger to make flush contact with the edge of the lever and then turn the Lever counter-clockwise at least 60 degree around the fulcrum. Otherwise the task has failed. The lever, initialized with

Spherical Diffusion Policy

Figure A1. Overview of the manipulation experimental setup: two UR5e manipulators, each with a compliant ray-fin fingers, two stationary overhead cameras, and GELLO teleop controllers.

a SE(3) pose, is flexibly positioned with a clamp, using a combination of pitch and yaw angles within the 3D workspace of the manipulator. A total of 33 human demonstrations have been collected, composed of 30 successful demo and 3 (10%) recovery demonstrations where failed states were corrected to reach to the successful goal state.

Push Eraser: An expert demonstrator moves one ray-fin finger to make contact with the Eraser, which is initialized with an SE(3) pose within the marked boundary on the whiteboard. Once the contact is secure, the demonstrator will move the ray-fin finger to push the Eraser, in a straight line, towards the closest edge, until the eraser is outside the marked rectangle, achieving a successful goal state. Otherwise the task has failed. The whiteboard is positioned, flexibly with a clamp, with approximate pitch angles of -15, 0, 30, 45, 60, and 90 degrees and approximate yaw angles of -15, 0, and 15 degrees within the 3D workspace of the manipulator. In total, 33 human demonstrations were collected, consisting of 30 successful demonstrations and 3 (10%) recovery demonstrations.

For the bi-manual manipulation tasks, we design a set of 432 distinct SE(3) poses, consisting of 9 regions on the x-y plane, 3 discrete pitch angles (0 , 8 , or 16 ) provided by an 8 wedge, and 16 discrete yaw angles with 15 or 30 increments. We randomly sample an SE(3) pose from this set to position a box, book, or container for collecting human demonstrations.

Grasp Box: an expert demonstrator moves two ray-fin fingers to pinch grasp the box at a sampled SE(3) pose, then lifts the pinched box minimally 40cm above the flat surface. Otherwise the task has failed. Similarly, we collect 33 human demonstrations including 3 recovery demos.

Flip Book: an expert demonstrator moves two ray-fin fingers to pinch grasp the book along its medium dimension at a sampled SE(3) pose, then rotates the pinched book in-hand so that the book is pinched along its smallest dimension, and then lifts the book minimally 40cm above the flat surface. Otherwise the task has failed. Due to the precision required for coordinated finger movements, we collect 66 human demonstrations including 6 recovery demos.

Pack Package: An expert demonstrator moves two ray-fin fingers to pinch-grasp the box, transports it to the pre-pack pose above the container, places the box inside, moves to the pre-lid-close pose, and finally closes the lid. The final goal state of the task is the box inside the container with the lid closed. If any step fails, the task is considered a failure. Due to the precision required for coordinated finger movements, we collect 66 human demonstrations, including 6 recovery demonstrations.

Spherical Diffusion Policy

(a) Single-arm Turn Lever

(b) Single-arm Push Eraser

(c) Bi-manual Grasp Box

(d) Bi-manual Flip Book

(e) Bi-manual Pack Package

Figure A2. Demonstrated robot actions for each task: a) Turn Lever, b) Push Eraser, and c) Grasp Box, each task with two trajectory segments and their respective goal states; d) Flip Book task with three trajectory segments and goal states, where both manipulators must perform coordinated movement after the initial pinch ; e) Pack Package with five trajectory segments and goal states.

B.3. Detailed Evaluation Results

We evaluate the baseline performance with 20 rollouts, using object poses randomly sampled from the SE(3) pose set in training. All poses are annotated, and evaluation is conducted on novel, unseen poses.

For all five physical tasks, we report the success rate for each intermediate goal state in Table A1. We compare SDP with Equi Diff (Wang et al., 2024b) and Diff Po (Chi et al., 2023). SDP achieves a strong performance with a minimum 90% success rate on the first step, where both Equi Diff and Diff Po perform poorly. For subsequent steps, the success rate drops by up to 20% in the Flip Book task, where precise coordination of both fingers is required for the in-hand pose rotation.

Figure A4 provides examples of task successes and failures. The most common failure cases for SDP occur during the book flip step. Lack of precise coordination between the two fingers may cause the book to either drop or be pinched too tightly, leading to a robot fault. Other failures include invalid pinching during the pick step or object drops due to a loose grip. For the Pack Box task, collisions with the container may occur during the transfer of the pinched box, and misalignment or incorrect placement can lead to collisions during the packing step. Interestingly, when positioning the lid to close, the robot may mistakenly identify the object as the lid. For the Turn Lever task, the finger may drift away from the lever while attempting to complete the required rotation. For the Push Eraser task, the robot pushes in the wrong direction, failing to push the eraser across the boundary. For the Grasp Box task, SDP struggles to pinch the box when its long dimension is parallel to the robot s front (i.e., when the box has a yaw angle of 0 or 180 degrees).

Spherical Diffusion Policy

(a) Single-arm Turn Lever

(b) Single-arm Push Eraser

(c) Bi-manual Grasp Box

(d) Bi-manual Flip Book

(e) Bi-manual Pack Package

Figure A3. Visualization of SE(3) Pose Distribution for Five Physical Tasks. The initial state of 4 out of 20 episodes are visualized.

C. Additional Background

C.1. Equivariant Diffusion

The theory of Equivariant Diffusion has been extensively investigated in (K ohler et al., 2020; Brehmer et al., 2024; Ryu et al., 2024; Wang et al., 2024b). Based on these works, we summarize the equivariance property of the policy and the denoising function of the policy for completeness. There are two scenarios: diffuse equivariance and denoise equivariance.

Proposition C.1 (Diffuse equivariance). If the policy is equivariant to group G, i.e., π(g S) = gπ(S), and the distribution D from which to sample the noise, is invariant to group G, i.e., D = g D, and the the denoising function satisfies

ϵθ(S, π(S) + ϵ, k) = ϵ, ϵ D, (18)

then the denoising function is equivariant to group G, i.e., ϵ(g S, g Ak, k) = gϵ(S, Ak, k),

Proof. We assume the denoising function satisfies Equation 18. Since the equation holds for all ϵ D and D is G-invariant,

Spherical Diffusion Policy

Table A1. Breakdown of success rates at each step for five physical experiments over 20 evaluation episodes. The action space and number of training demonstrations are same as in Table 3.

Turn Lever Push Eraser Grasp Box Flip Book Pack Box

Method Contact Rotate Contact Push Grasp Lift Grasp Flip Lift Pick Transport Pack Locate Close

SDP 90 80 90 90 90 85 100 80 65 95 85 75 70 70 EDP 35 20 40 30 45 35 0 0 0 45 35 10 0 0 DP 30 10 15 10 30 15 35 0 0 70 35 5 5 0

(a) Single-arm Turn Lever

(b) Single-arm Push Eraser

(c) Bi-manual Grasp Box

(d) Bi-manual Flip Book

(e) Bi-manual Pack Package

Figure A4. Examples of successes and failures for each task, with green indicating successful behaviors and red indicating failures.

we can evaluate at g S and gϵ, ϵθ(g S, π(g S) + gϵ, k) = gϵ.

Using the equivariance of π and substituting in Eqn.18 for ϵ gives

ϵθ(g S, gπ(S) + gϵ, k) = gϵθ(S, π(S) + ϵ, k)

Spherical Diffusion Policy

as desired.

Proposition C.2 (Denoise equivariance). The action prediction is equivariant to group G, i.e., π(g S) = gπ(S), when the denoising function is equivariant to group G, i.e., ϵ(g S, g Ak, k) = gϵ(S, Ak, k), and the distribution D from which to sample the noise, is invariant to group G, i.e., D = g D.

Proof. Simplifying Equation. 1 by dropping t we have:

Ak 1 = α Ak γϵθ(S, Ak, k) + z , z D (19)

When k = K, the denoise is sampled form D: ϵθ(S, A, K) = d1, d1 D, and the denoised action AK 1 is:

AK 1 = α(0 γd1 + d2), d1, d2 D (20)

= α(d2 γd1) (21)

= α(gd 2 γgd 1), d 1, d 2 D (22)

= gα(d 2 γd 1) (23)

= g AK 1 (24)

When k = K 1, the denoise is applied to the noisy action, to generate the cleaner action. Transforming the input to the denoising function in Equation. 19 by g:

α AK 1 γϵθ(g S, g AK 1, K 1) + z (25)

= α g AK 1 gγϵθ(S, AK 1, K 1) + gz (26)

= gα AK 1 γϵθ(S, AK 1, K 1) + z (27)

= g AK 2 (28)

Equation. 28 holds for k = K 1, K 2, ..., 1, thus by applying it iteratively, we have g A0.

Proposition. C.1, C.2 specify the prerequisites of an equivariant diffusion policy. Empirically we find that the group invariant distribution D can be relaxed to a Gaussian distribution N, which is simple and achieves good performance.

C.2. Translation Invariance by Canonicalization

Translation invariance is achieved using a relative action formulation (Chi et al., 2024) and state-action canonicalization (Zeng et al., 2022; Wang et al., 2021; Zhu et al., 2022a; Jia et al., 2023). We summarize and proof this property.

Proposition C.3. The relative state-action formulation is T(3) (translational) invariant.

Proof. Translating both the state S and the action A by g T(3) we have:

g Scan = (O + g) (e T + g),

(e T + g) (e T + g),

= (O e T , e T e T , e R, egrip)

g Acan = (AT + g) (e T + g), AR, Agrip

= (AT e T , AR, Agrip)

Therefore π(Scan) = π(g Scan) = g Acan = Acan.

Spherical Diffusion Policy

D. Hyperparameters

Hyperparameters for diffusion-based baseline methods are listed in Table A2. SDP generally adopts Diffusion Policy s hyperparameters, except for batch size, because SDP is heavier than other baselines.

SDP Equi Diff Equi Bot Diff Po DP3 DP3 paper

Batch Size 32 64 64 64 128 128 Prediction Horizon 16 16 16 16 16 16 Action Horizon 8 8 8 8 8 8 Learning Rate 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 Epochs 500 500 500 500 500 3000 Learning Rate Scheduler cosine cosine cosine cosine cosine cosine Noise Scheduler DDPM DDPM DDPM DDPM DDIM DDIM Diffusion Train/Test Step 100 100 100 100 100/10 100/10 Encoded Scene Dimension 128 128 128 128 64 64

Table A2. Hyperparameters for baselines.

E. Architecture of the Point Cloud Encoder

The point cloud encoder is a 5-layer Res Net (He et al., 2016), consisting of Equiformer V2 (Liao et al., 2024) graph convolution layers in the hidden layers and Equiformer V2 origin convolution layer in the last layer to aggregate all the features into a single point.

Figure A5. Overview of the Point Cloud Encoder. Top: the point cloud encoder. Bottom: the details of each block in the encoder. pts stands for the number of points and c stands for the number of channels.