# harmonizing_geometry_and_uncertainty_diffusion_with_hyperspheres__856756d6.pdf Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres Muskan Dosi 1 Chiranjeev Chiranjeev 1 Kartik Thakral 1 Mayank Vatsa 1 Richa Singh 1 Do contemporary diffusion models preserve the class geometry of hyperspherical data? Standard diffusion models rely on isotropic Gaussian noise in the forward process, inherently favoring Euclidean spaces. However, many real-world problems involve non-Euclidean distributions, such as hyperspherical manifolds, where class-specific patterns are governed by angular geometry within hypercones. When modeled in Euclidean space, these angular subtleties are lost, leading to suboptimal generative performance. To address this limitation, we introduce Hyper Sphere Diff to align hyperspherical structures with directional noise, preserving class geometry and effectively capturing angular uncertainty. We demonstrate both theoretically and empirically that this approach aligns the generative process with the intrinsic geometry of hyperspherical data, resulting in more accurate and geometry-aware generative models. We evaluate our framework on four object datasets and two face datasets, showing that incorporating angular uncertainty better preserves the underlying hyperspherical manifold. Resources are available at: Link. 1. Introduction Diffusion models have revolutionized generative modeling, achieving remarkable success in diverse modalities, including generation of images (Ho et al., 2020; Dhariwal & Nichol, 2021), audio (Kong et al., 2021), 3D scenes and 3D structures (Bautista et al., 2022; Shue et al., 2023). These models operate by progressively adding Gaussian noise to the data in a forward process, followed by a reverse process that learns to recover the original data from the corrupted versions (Sohl-Dickstein et al., 2015). The use of Gaussian 1Department of Computer Science and Engineering, Indian Institute of Technology Jodhpur, India. Correspondence to: Muskan Dosi . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). noise, coupled with its isotropic nature, simplifies both theoretical formulations and practical implementations, making it a default choice for most diffusion frameworks (Kingma et al., 2021). However, this assumption inherently biases the data to Euclidean spaces (Song et al., 2021; Dhariwal & Nichol, 2021), limiting the model s ability to account for intrinsic geometric structures in non-Euclidean domains (Lui, 2012; Scott et al., 2021; De Bortoli et al., 2022), such as hyperspherical or manifold-constrained data (Bronstein et al., 2017; Rezende et al., 2020). In such settings, Gaussianbased diffusion may fail to fully capture the directional relationships and nuanced variations inherent to the data, as shown in Figure 1(a) (top row) where the forward process of adding Gaussian noise leads to distortion in the angular geometry of the classes during sampling. It motivates a need for rethinking the noise distribution in the diffusion processes. Beyond the limitations of geometry, Gaussian noise also overlooks the flexibility required to model varying uncertainty levels across different samples. Ambiguous or noisy data points appear across different timesteps in Figure 1(a) (top row). These points require a representation where uncertainty is explicitly modeled. However, the isotropic assumption of Gaussian noise treats all directions equally, failing to capture this uncertainty effectively. Researchers have explored alternative noise distributions better suited to specific data geometries and uncertainty modeling (Xu et al., 2023). One promising direction is the von Mises Fisher (v MF) distribution (Mardia & Jupp, 2000), which is naturally defined on hyperspheres and parameterized by a concentration parameter that controls directional uncertainty (Hasnat et al., 2017). For effective handling of uncertainty using v MF noise in diffusion models, we aim to align the generative process with the underlying geometry of the data, thus enabling more accurate modeling of directional relationships. This approach builds on recent efforts to integrate geometric priors into generative models (Falorsi et al., 2018), expanding the scope of diffusion techniques to hyperspherical data. 1.1. Research Contributions We propose Hyper Sphere Diff (Diffusion with Hyper Spheres), leveraging the von Mises-Fisher (v MF) distribution to preserve hyperspherical class geometry by introduc- Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres Gaussian-Based Diffusion Process v MF-Based Diffusion Process Concentration kappa (𝜅) Time step (t) Kappa (𝜅) = 0 Time step (t) = T Figure 1. (a) Illustrating class geometry preservation in non-Euclidean hyperspherical spaces. The top row shows Gaussian diffusion, which fails to capture structural relations, whereas the bottom row demonstrates von Mises Fisher (v MF)-based diffusion, effectively modeling directional uncertainty. Red and green arrows indicate forward and backward diffusion process, respectively. (b) Comparison of on the 3D sphere: Gaussian (top) vs. v MF (bottom). Gaussian distorts class boundaries, while v MF maintains the original geometry, preserving angular regions. ing directional uncertainty into the diffusion process. It is visually illustrated through Figure 1(a) (bottom row). Here, geometry precisely captures angular relationships defining class boundaries, whereas uncertainty reflects stochastic variation governed by the v MF concentration parameter. Figure 1(b) provides empirical visualization demonstrating that v MF-driven diffusion preserves hyper-conical class geometry, while Gaussian-based diffusion distorts it. By embedding v MF-based noise into both forward and reverse processes, the manifold-aware framework maintains global structure and local directional fidelity. It ensures directional uncertainty respects hyperspherical structures throughout the diffusion process. The key contributions of this research are: 1. Class-Specific Hypercone Formalism. We introduce a hypercone representation that preserves intra-class angular concentration on the hypersphere while enforcing inter-class separation. This formalism captures class structure by focusing on directional relationships rather than purely Euclidean distances. 2. v MF-Based Forward and Reverse Processes. Gaussian noise is replaced with v MF noise to maintain angular consistency. This ensures generated samples remain on the hypersphere, effectively guiding them toward class-specific hypercones during the reverse process. 3. Diverse and Hard Sample Generation. In contrast to Gaussian-based methods, the v MF-driven approach mitigates simplicity bias, producing a broader sample distribution. We introduce two metrics, Hypercone Coverage Ratio (HCR) and Hypercone Difficulty Skew (HDS), for assessing geometry preservation. 4. Theoretical Foundations and Empirical Validation. Along with detailed theoretical foundations, through extensive experiments on four object datasets and two face datasets, we demonstrate improved alignment with hyperspherical geometry. We observed enhanced robustness to varying uncertainty levels and superior performance in class-conditional generation tasks. 1.2. Related Work Denoising diffusion models can generate diverse unseen image samples by training on existing datasets (Ho et al., 2020; Song et al., 2021; Dhariwal & Nichol, 2021). First introduced by (Sohl-Dickstein et al., 2015), they added Gaussian noise in a forward process and learned to reverse it for sample generation. Advances in score-based modeling (Song & Ermon, 2019; Pidstrigach, 2022), efficient sampling (Watson et al., 2022), and discrete diffusion (Austin et al., 2021) have further improved their capabilities. However, their reliance on Gaussian noise limits them to Euclidean spaces, restricting their ability to model directional relationships in non-Euclidean domains (Bronstein et al., 2017). While manifold-aware and sphere-based extensions (Rezende et al., 2020) attempt to address this, they remain limited in handling anisotropic and angular uncertainties. The von Mises-Fisher (v MF) distribution, commonly used for hyperspherical data, has demonstrated effectiveness in Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres face recognition (Hasnat et al., 2017; Deng et al., 2019), outlier detection and generation (Du et al., 2022; Ming et al., 2023; Du et al., 2023) and representation learning tasks (Davidson et al., 2018). It has also been employed in generative modeling through hyperspherical GANs (Davidson et al., 2018) and spherical VAEs (Falorsi et al., 2018). However, the integration of v MF noise into diffusion models remains relatively unexplored. Previous works exploring non-Gaussian noise in diffusion processes (Wang et al., 2022; Xu et al., 2023) lack a systematic approach to leverage angular distributions like v MF. In addition to Gaussian and v MF noise, alternative noise types such as Laplacian noise (Dwork & et al., 2006), Student s t-noise (Matsubara & Imai, 2021), and blue noise (Huang et al., 2024) have been explored for robustness and outlier handling. Flow-based transformations have also been used in generative tasks (Kingma & Dhariwal, 2018), while domain-specific noises in molecular generation (Luo et al., 2021) focus on aligning noise with data properties. Our work adds to this growing body of literature by introducing a novel angular noise mechanism based on v MF distributions, enabling the diffusion process to model classspecific and directional uncertainty in hyperspherical domains. Progressive Distillation (Salimans & Ho, 2022) uses an angular DDIM update in Euclidean space but ignores hyperspherical constraints. EDM (Karras et al., 2022) ensures variance preservation via dataset-dependent scaling but lacks directional or geometric alignment. In contrast, Hyper Sphere Diffoperates directly on the hypersphere, preserving both variance and manifold consistency without extrinsic normalization. 2. Preliminary In Gaussian-based diffusion models, the forward process is defined as a Markov chain that incrementally corrupts the data by introducing isotropic Gaussian noise. Let x0 pdata(x0) represent the original data. The sequence of noisy samples xt is generated according to: xt = αtx0 + where, αt (0, 1] is a variance schedule controlling the relative weight of the signal and noise at time t and ϵ N(0, I) is isotropic Gaussian. The geometry of the corrupted data distribution in a time step t is defined by the weighted combination of the original data x0 and the noise component ϵ. The resulting distribution can be expressed as: q(xt|x0) = N(xt; αtx0, (1 αt)I), which defines a trajectory in the Euclidean space where the corrupted data progressively transitions from the original distribution pdata(x0) to a standard Gaussian distribution N(0, I) as t T . The uncertainty in the forward process is introduced through the isotropic Gaussian noise ϵ N(0, I). At each time step, the variance of the noise component 1 αt increases as αt decreases, representing a controlled diffusion of information. This uncertainty is modeled symmetrically across all dimensions of the data, resulting in a spherically symmetric probability density function. Formally, the marginal distribution at time t can be expressed as: q(xt) = Z q(xt|x0)pdata(x0) dx0, which represents the corrupted data distribution as a Gaussian mixture where each component is centered at αtx0 and has variance 1 αt . The reverse process is designed to approximate p(x0|xt) and recover the original data by progressively denoising the sample xt. 3. Revisiting Hyperspherical Data Geometry The use of the Gaussian noise simplifies theoretical analysis and computational implementation by assuming data resides in the flat, unbounded Euclidean space Rd. However, this approach is fundamentally sub-optimal when dealing with non-Euclidean data geometries, such as hyperspheres Sd 1, or other manifolds. Theorem 3.1 (Gaussian Noise and Spherical Structure). Let z N(0, I) be an isotropic Gaussian in Rd. Then: (a) The radial density follows: P( z = r) rd 1e r2/2 (b) As d : z (c) For data x Sd 1, the noised vector x + σz does not preserve angular relationships as σ . Proof. (a) Follows from the Jacobian of spherical coordinates, (b) By the law of large numbers, z 2/d 1 in probability, and (c) As σ , the contribution of x becomes negligible. Gaussian noise works well in flat, unbounded spaces due to its isotropic nature. However, many real-world datasets reside on curved spaces, such as hyperspheres or other Riemannian manifolds. In these cases, Gaussian noise disrupts the intrinsic geometry by introducing perturbations that extend beyond the manifold (Bronstein et al., 2017; Huang Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres et al., 2022) (detailed proof in Appendix A). This misalignment motivates the exploration of alternative noise processes designed for data with non-Euclidean geometries. Facial data embeddings, often derived through deep neural networks, are typically normalized to lie on a hypersphere Sd 1 Rd, reflecting their natural angular variability due to factors like pose, expression, and illumination changes (Majumdar et al., 2017; Wang et al., 2018; Deng et al., 2019). This normalization ensures that the similarity between two embeddings is determined by their angular relationship rather than their magnitude, aligning with the cosine similarity metric frequently used in recognition tasks. For two embeddings e1, e2 Sd 1, their similarity can be expressed as cos(θ) = e 1 e2 , where θ = arccos(e 1 e2) is the geodesic distance on the hypersphere. This structure inherently aligns facial embeddings with hyperspherical geometry, where angular deviations are the primary measure of variability. In high-dimensional hyperspherical spaces, facial classes can be modeled as distinct regions, often visualized as hypercones emanating from the origin. Each class Ck Sd 1 is centered around a mean direction vector µk Sd 1, with intra-class variability characterized by angular deviations. The v MF distribution provides a natural framework for modeling such hypercones due to its concentration parameter κk, which controls the spread of the distribution around µk. The probability density function is given as: f(x; µ, κ) = κ(d/2) 1 (2π)d/2I(d/2) 1(κ) exp(κµ x), where, I(d/2) 1(κ) is the modified Bessel function of the first kind. To formalize this relationship, we introduce the following lemma: Lemma 3.2 (v MF Hypercone Representation). Let Ck Sd 1 be a class hypercone centered at µk with angular radius θk. The data follows a v MF distribution with concentration κk. Then: (a) The probability mass within the class hypercone is: P(x Ck) = P( (x, µk) θk) 1 exp( κk(1 cos θk)) (b) For any desired coverage probability 1 ϵ with ϵ (0, 1), choosing: κk 1 1 cos θk log 1 ensures that P(x Ck) 1 ϵ. Note: As θk 0, the required concentration κk , reflecting the scenario of a single-direction hypercone. On the hypersphere Sd 1, classes naturally align within hypercones defined by angular constraints, and the v MF distribution inherently models these structures. Such distributions also relate to class separation on hyperspheres (refer Appendix B). Unlike Gaussian noise, v MF noise respects the manifold s geometry by modulating its focus through the concentration parameter κ. When κ = 0, v MF is isotropic, uniformly spanning the hypersphere, while larger κ values focus the distribution within a hypercone around the mean direction µ. This adaptability enables v MF-based diffusion to effectively learn and represent class-wise structures, ensuring a geometrically consistent generative modeling framework. 4. Modeling Angular Uncertainty in Diffusion To effectively model the uncertainty inherent in hyperspherical data, we employ a forward diffusion process driven by noise sampled from the v MF distribution. The forward process introduces angular-based uncertainty such that the data progressively transitions from structured, class-specific noise to a uniform distribution over the hypersphere Sd 1. Initially, the noise is highly concentrated (κ is large), preserving the class structure within narrow hypercones. As the process advances, κ decreases to zero, injecting isotropic noise, and the data diffuses uniformly over the hypersphere. 4.1. Angular Noise Injection In the proposed Hyper Sphere Diff, the forward process adds noise to the data representation using angular interpolation. At each time step t, the data representation obtained in a latent space through an encoder is denoted as zt, which is updated as zt = cos(θt)zt 1 + sin(θt)v, where zt 1 Sd 1 is the data representation at the previous time step, v is a unit vector sampled uniformly from Sd 1, and θt is a timedependent angle that increases monotonically from 0 to π/2. The angular interpolation ensures that the injected noise aligns with the hyperspherical geometry. The parameter θt acts as a scheduler, gradually increasing to control the extent of deviation from the previous representation. Correspondingly, this forward step mimics the v MF distribution using κt as a scheduler. The decrease in κt defines the level of angular uncertainty introduced at each step. By employing κ as a scheduler, our forward diffusion process progressively incorporates angular uncertainty, transitioning from class-structured perturbations to isotropic noise (Figure 1). This ensures that the initial steps of the corruption process retain class information, as larger κ values concentrate the data around class-specific hypercones. Over time, as κ 0 , the noise diffuses the data to a uniform hypersphere, reflecting maximum uncertainty. This process inherently respects the geometric properties of hyperspherical data, with angular deviations tied to the data Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres distribution rather than being purely random. 4.2. Backward Step: Hypersphere to Hypercone In generative modeling with hyperspherical data, the reverse process of Hyper Sphere Diff transforms noisy samples into structured data aligned with class distributions. This process progressively refines noisy points toward class-specific hypercones using score-based methods. The reverse formulation leverages v MF sampling, ensuring angular relationships between classes are retained. The reverse process is defined as: zt 1 v MF (Π(zt + ηt zt log f(zt; µt, κt)), κt) , The complete algorithm A and experiments are based on this reverse step and MSE loss as defined in standard Gaussian diffusion. Alternatively, the reverse step with angular updates using v MF-based stochastic denoising is defined as: Π cos(θt)zt + sin(θt) zt log f(zt; µc) zt log f(zt; µc) To further align optimization with hyperspherical geometry, the reverse denoising step is refined using an angular-based loss function alternative to the MSE loss. Cosine Loss: Encourages angular alignment between the score function and noise direction. Lc = 1 E zt log f(zt; µc) ϵt Geodesic Loss: Penalizes angular deviations. Lg = E arccos2 zt log f(zt; µc) ϵt zt log f(zt; µc) ϵt where, zt represents the current noisy sample, f( ) is the v MF distribution modeling class hypercones, and zt log f( ) is the score function providing gradient information. Here, interpolation using cos(θt) and sin(θt) maintains angular relationships, while the normalized score function preserves directional consistency. (Refer Appendix C) The comparative analysis of various loss functions and angular based denoising is provided in Appendix (Refer Appendix H.2). The process evolves from isotropic noise to class-specific structures through the interplay of two key components. The score function zt log f(zt; µt, κt) points towards the class means µc, with the step size ηt controlling the update strength. Simultaneously, v MF sampling adds controlled stochasticity, ensuring updates remain consistent with hyperspherical geometry while progressively removing noise. The concentration parameter κt plays a crucial role in shaping this evolution. It starts small for isotropic diffusion and then grows during the reverse process, increasingly favoring class-specific directions. As κt increases, the v MF sampling becomes more concentrated around the class means, reflecting the growing certainty in class membership while maintaining geometric consistency. The combination of score-based updates and v MF sampling ensures a smooth transition from noise to structured data. The process gradually refines points until they converge near their respective class means µc for a class c, effectively recovering true class representations while preserving the underlying hyperspherical geometry. This mechanism provides a mathematically principled approach to generating class-consistent samples on the hypersphere, maintaining both local structure and global class relationships throughout the diffusion process. Algorithm 1 Hyper Sphere Diff Training: v MF Diffusion with Hypercone Preservation Require: Data samples {xi}, class labels {yi}, diffusion steps T Require: Angular schedule {θt}T t=1, Learning rate η 1: Initialize score network parameters θ 2: while not converged do 3: Sample batch (x, y) from dataset 4: Sample time step t Uniform(1, T) 5: κt cot(θt) Set concentration 6: Sample v Uniform(Sd 1) 7: zt Π(cos(θt)x + sin(θt)v) Forward process 8: similar to zt v MF(x, κt) 9: zt log f Score Netθ(zt, t, y) 10: L zt log f zt log p(zt|x) 2 11: Update θ using gradient of L 12: end while 13: return Trained score network parameters θ 4.3. Adaptive Class-Dependent Concentration We consider data z0 on the unit hypersphere and a forward diffusion process as before with v MF transitions. In the reverse process pθ(z0:T ) = pθ(z0 | z1, y) QT t=1 pθ(zt 1 | zt, y), we guide samples into learned truncated regions on the hypersphere. We employ two networks: a direction predictor Dϕ that estimates class-specific directions mt = Dϕ(zt, t, y), and an angle predictor Cψ that estimates class angular radii θy = Cψ(zt, t, y). These networks learn to form dynamic hypercones Ct,y = {z: (z, mt) θy} that capture class-specific regions at each step. The reverse process uses truncated v MF: pθ(z0 | z1, y) = Tv MF z0; mθ(z1, y), κθ(z1, y), Ct,y , Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres Algorithm 2 Hyper Sphere Diff Testing: Sampling from v MF Diffusion with Class Guidance Require: Class label y, diffusion steps T, trained score network θ Require: Angular schedule {θt}T t=1, step sizes {ηt}T t=1 1: Sample z T Uniform(Sd 1) 2: for t = T to 1 do 3: κt cot(θt) Get concentration 4: if t > 1 then 5: zt log f Score Netθ(zt, t, y) 6: mt zt + ηt zt log f 7: mt mt/ mt Project to hypersphere 8: zt 1 v MF(mt, κt) 9: end if 10: end for 11: return Final sample z1 Tv MF(z; µ, κ, C) = 1 Z(κ, C) exp κ µ z 1{z C}, We maintain adaptive concentration through κθ(zt, y) = κmax σ β [θy (zt, mt)] , where β is a scaling factor, κmax is maximum concentration allowed and σ(.) is sigmoid function, ensuring stronger concentration within predicted regions. The normalization constant Z(κ, C) for the truncated v MF can be computed as: Z(κ, C) = Z C exp(κ µ z) d Sd 1(z) where, d Sd 1 is the hyperspherical measure. This integral has a closed form in terms of the incomplete gamma function when C is a hypercone. The networks Dϕ and Cψ are trained jointly with the score network to minimize the objective: L(ϕ, ψ) = Et,y h mt ˆmy 2 + λ(θy ˆθy)2i where, ˆmy and ˆθy are empirical class statistics computed from training data, and λ balances the direction and angle losses. This ensures that the predicted geometry aligns with the true class structure while maintaining the flexibility of learned truncation regions. The combination of learned directional prediction, adaptive concentration, and explicit truncation provides a powerful mechanism for class-conditional generation. At each step t, the process maintains a dynamic balance between scorebased updates and geometric constraints: E[zt 1|zt, y] = mt+ηt zt log f(zt; µt, κt) 1{zt Ct,y} This ensures samples remain within class-appropriate regions while benefiting from the score function s gradient information. Detailed proofs of forward and reverse modeling are shown in Appendix D. Table 1. Performance comparison of Gaussian and Hyper Sphere Diff across six datasets using FID (lower is better), HCR (lower is better), and HDS (lower values indicates harder samples). The v MF model demonstrates superior capability in generating challenging samples with better FID and HDS scores. Dataset FID HCR HDS Gaussian v MF Gaussian v MF Gaussian v MF MNIST 1.95 1.86 0.17 0.14 0.76 0.52 CIFAR-10 3.45 3.52 0.23 0.2 0.72 0.48 CUB-200 8.47 9.11 0.19 0.13 0.81 0.41 Cars-196 9.09 7.87 0.19 0.17 0.85 0.6 Celeb A 9.31 9.29 0.42 0.22 0.77 0.59 D-LORD 11.38 9.27 0.46 0.21 0.91 0.62 4.4. Beyond Geometry and Uncertainty: Retaining Image Information In our face generation process, we introduce two forms of uncertainty to capture both magnitude and directional components in the data representation motivated by (Chiranjeev et al., 2024). Specifically, we interpret an image embedding x as having a magnitude x (which relates to overall intensity or scale) and a direction x/ x Sd 1 (which captures structural and class-related geometry). We employ a Gaussian distribution to model the magnitude uncertainty and a v MF distribution to model the directional uncertainty. This hybrid approach preserves crucial class geometry on the hypersphere, while still accommodating image-specific variability (e.g. changes in lighting or intensity) through Gaussian noise. The detailed forward and reverse process is provided in Appendix F. 5. Experiments We evaluate Hyper Sphere Diff on four object datasets (CIFAR-10 (Krizhevsky et al., 2009), MNIST (Deng, 2012), CUB-200 (Wah et al., 2011) and Cars-196 (Krause et al., 2013)), and two face datasets (Celeb A (Liu et al., 2015) and D-LORD (Manchanda et al., 2023)). The architecture and training details are provided in Appendix H. Retaining Geometry: We define two metrics to evaluate whether the generated samples preserve the class structure and maintain distributional properties of the original data. Hypercone Coverage Ratio (HCR): The HCR quantifies the percentage of generated samples outside the class distribution s hypercone. For each class k, we compute: i=1 1 cos 1(zi µk) > θmax k A lower HCR indicates better preservation of the class structure, while a higher HCR suggests out-of-class or unrealistic samples. Hypercone Difficulty Skew (HDS): The HDS measures whether the model generates easy samples by analyzing how Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres CUB-200 Celeb-A Figure 2. Generated samples of birds (left), faces (middle) and cars (right) using the proposed v MF based diffusion model trained on CUB-200, Celeb A and Cars-196 dataset, showcasing the preservation of key attributes and diversity. Concentration kappa (𝜅) decreases Class Mean (𝜇) Figure 3. Real-world surveillance samples generated using classdependent adaptive concentration κ, through proposed diffusion model after training on D-LORD dataset. they are distributed across sub-cones of increasing angular deviation. A high HDS indicates the model generates easy samples, while a low HDS suggests a balanced distribution across difficulty levels (refer Appendix H.1). For each class k, we compute the fraction of samples in each sub-cone and compute HDS: i=1 1 θm 1 < cos 1(zi µk) θm The HCR and HDS metrics offer a thorough assessment of generative models in angular space. HCR measures class consistency and assesses how well intra-class variations are preserved, while HDS identifies whether the model favors generating easier samples. An ideal generative model should maintain a low HCR while achieving an HDS that aligns with the natural difficulty distribution of the original data. Table 1 highlights the comparison between the Gaussian and v MF models in terms of HCR and HDS in six datasets. The Hypercone Coverage Ratio (HCR) consistently decreases for the v MF model compared to the Gaussian model, indicat- ing that the v MF generates samples that follow the original class structure, thus preserving angular relations. Similarly, the Hypercone Difficulty Skew (HDS) value with the v MF model in all six datasets is around 0.5, reflecting its ability to generate generalized samples across the hypercone rather than generating simple samples. Gaussian-based generated samples have HDS values around 0.6, showing that a higher proportion of samples are generated from the innermost hypercone demonstrating simplicity bias. For datasets like D-LORD and CUB-200, the reduction in HCR and HDS is particularly pronounced, showcasing the efficiency of the v MF model in modeling angular variation and producing diverse samples. Generation Quality: We evaluate the quality of generated samples using our method across six diverse datasets. These include natural images (Celeb A, CIFAR-10), fine-grained object categories (CUB-200, Cars-196), and structured digits (MNIST). Figure 2 presents examples of generated faces and birds from models trained on Celeb A, Cars-196, and CUB-200. The images exhibit realistic structural coherence and diversity, resembling their respective distributions. We also compute the Fr echet Inception Distance (FID) (Bynagari, 2019), achieving competitive scores across all datasets (Table 1). In particular, Celeb A, D-LORD, and Cars-196 show high performance, indicating that the model captures both global structure and fine-grained details. This demonstrates that Hyper Sphere Diff generates high-quality samples with significant variation while preserving class structure, aligning with our theoretical insights (See Appendix H). Hypercone Generation: We evaluate the hypercone constraint by analyzing generated samples from both inner and outer hypercones, shown in Appendix H. Samples in the inner hypercone retain class-consistent features, while those in the outer hypercone exhibit noisy generations, validating the effectiveness of our metric in capturing generation quality. Furthermore, for the real-world D-LORD (Manchanda Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres Pose Illumination Expressions Gaussian v MF Figure 4. (a) Comparison of samples generated using Gaussian and v MF diffusion models, highlighting improved variation in pose, illumination, expression, and quality with v MF. (b) Scatter plot with fitted regression shows cosine similarity to the class mean, where Hyper Sphere Diff generates more challenging samples with broader similarity distribution compared to Gaussian. Figure 5. Gaussian-based diffusion generates samples tightly clustered near the class mean (high similarity, low variance), favoring easier cases. In contrast, Hyper Sphere Diff produces a more diverse spread (lower similarity, higher variance), ensuring better coverage across difficulty levels. et al., 2023) dataset, we sample images at varying kappa values shown in Figure 3, concentrating the samples around the mean vector to reflect distance-wise variations in the surveillance imagery. These experiments highlight how diffusion with hypercone constraints ensures both intra-class consistency and controlled diversity. Face Recognition Results: We evaluate the performance of our proposed method on the real-world D-LORD (Manchanda et al., 2023) surveillance dataset, which consists of sequential frame images exhibiting angular relationships in the feature space. We generated samples of 1000 synthetic subjects using both Gaussian-based diffusion and our v MFbased hyperspherical diffusion. Training the Arc Face model on the generated data leads to 5.0% improvement on the real test set identification accuracy with the Hyper Sphere D- Table 2. FID scores for different noise configurations across Celeb A and D-LORD datasets. The hybrid Gaussian + Spherical method achieves the best performance. Dataset Gaussian Gaussian + Spherical Spherical Celeb-A 9.31 9.29 9.31 D-LORD 11.38 9.27 10.94 iff compared to the Gaussian model. This result highlights the ability of our approach to generate data that better aligns with the angular structure of real-world surveillance images, effectively capturing the underlying geometry. Generating Hard Samples: We observed that the v MFbased approach consistently produced samples with varied cosine similarity to the class mean, indicating a higher degree of generalized sample generation compared to the Gaussian baseline, as shown in Figure 4. This is also supported by qualitative results, where v MF samples exhibit greater variations in pose, illumination and expression, closely resembling real-world challenges. Quantitatively, this observation is validated using HDS metric (Table 1), which measures the proportion of samples concentrated in the smaller hypercones. The v MF model achieved lower HDS values, confirming its effectiveness in generating harder, more diverse samples that can better represent challenging scenarios in real-world datasets. Further, Figure 5 shows Hyper Sphere Diff yields broader coverage (mean cosine similarity=0.722, std=0.137), reflecting balanced difficulty representation compared to Gaussian diffusion (mean=0.900, std=0.048). Ablation Study: Table 2 presents FID scores comparison on Celeb-A and D-LORD datasets under three noise strategies: Gaussian, Spherical (Hyper Sphere Diff), and a hybrid of Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres Figure 6. Feature representation of the 10-class MNIST dataset generated using Gaussian-based diffusion (left) and v MF-based diffusion (right). The v MF-based sampling aligns generated sample features within class-specific 3D hypercones, while Gaussian-based sampling results in scattered features outside the class-hypercones. Figure 7. Comparison between interpolation using Gaussian-based diffusion and v MF-based diffusion for generating images between two variants of the same subject: (a) Expression, (b) Pose. both. The hybrid model yields the lowest FID scores across both datasets (9.29 for Celeb-A and 9.27 for D-LORD), suggesting that combining magnitude-based Gaussian noise with direction-aware spherical noise results in better generation quality. Purely Gaussian or purely spherical models alone are less effective, particularly for D-LORD, where Gaussian diffusion performs poorly (FID = 11.38) and the spherical-only model underperforms compared to the hybrid. This demonstrates the advantage of modeling dual uncertainty by capturing both intensity variation and directional structure to enhance the realism and class consistency. Feature Representation: Figure 6 illustrates the feature representations of conditional samples generated from the 10 classes of the MNIST dataset. The figure highlights that the v MF-based reverse sampling effectively converges samples within class-specific hypercones, capturing the angular geometry of the data. In contrast, Gaussian-based reverse sampling produces samples that converge within Euclidean space, failing to adhere to the hyperspherical structure. Interpolation Results: Figure 7 shows interpolations using Gaussian-based diffusion (top row) which often produce unnatural or inconsistent transitions, especially with large attribute shifts. In contrast, our Hyper Sphere Diff (bottom row) angular interpolation on a hypersphere yields smoother, identity-preserving transitions across poses and expressions. 6. Conclusion and Future Work In this work, we have challenged the ubiquitous Gaussian assumption in diffusion models by introducing a novel hyperspherical framework Hyper Sphere Diff that leverages v MF distributions for generative modeling on angular manifolds. By incorporating hyperspherical geometry and classdependent uncertainty, our approach preserves angular structure while producing diverse, semantically rich samples. Extensive experiments on facial and complex object datasets demonstrate its effectiveness in fine-grained tasks where angular relationships are critical. This work opens new avenues for manifold-constrained generative modeling, advancing geometry-aware diffusion techniques. Future research will focus on developing adaptive κ-based schedulers, adopting hierarchical hypercone partitioning for finer class variations, and extending the framework to conditional generation tasks, such as pose-invariant face synthesis. Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres Impact Statement This work introduces a novel hyperspherical diffusion framework leveraging von Mises Fisher (v MF) distributions to enhance the modeling of high-dimensional angular data in Machine Learning. By improving the fidelity and interpretability of generative models, the proposed method has applications in computer vision, fine-grained classification, and surveillance. However, like other generative techniques, it carries potential ethical risks, including misuse in deepfakes, privacy concerns, and bias amplification. To mitigate these risks, we emphasize responsible use and transparency, particularly in sensitive domains (Mittal et al., 2024). Overall, this research advances generative modeling for hyperspherical data while promoting a deeper understanding of geometric structures in Machine Learning. Acknowledgements The authors acknowledge the support of India AI and Meta through the Srijan: Centre of Excellence for Generative AI. Austin, J., Johnson, D., Ho, J., Tarlow, D., and van den Berg, R. Structured denoising diffusion models in discrete statespaces. In Advances in neural information processing systems, volume 34, pp. 17981 17993, 2021. Bautista, M. A., Guo, P., Abnar, S., Talbott, W., Toshev, A., Chen, Z., Dinh, L., Zhai, S., Goh, H., Ulbricht, D., et al. Gaudi: A neural architect for immersive 3d scene generation. In Advances in neural information processing systems, volume 35, pp. 25102 25116, 2022. Bronstein, M. M., Bruna, J., Le Cun, Y., Szlam, A., and Vandergheynst, P. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4): 18 42, 2017. Bynagari, N. B. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Asian Journal of Applied Science and Engineering, 8:25 34, 2019. Chiranjeev, C., Dosi, M., Thakral, K., Vatsa, M., and Singh, R. Hyperspacex: Radial and angular exploration of hyperspherical dimensions. In European Conference on Computer Vision, pp. 1 17. Springer, 2024. Davidson, T., Falorsi, L., and De Cao, N. Hyperspherical variational auto-encoders. In Advances in neural information processing systems, 2018. De Bortoli, V., Mathieu, E., Hutchinson, M., Thornton, J., Teh, Y. W., and Doucet, A. Riemannian score-based generative modelling. In Advances in neural information processing systems, volume 35, pp. 2406 2422, 2022. Deng, J., Guo, J., Xue, N., and Zafeiriou, S. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 4690 4699, 2019. Deng, L. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141 142, 2012. Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. In Advances in neural information processing systems, volume 34, pp. 8780 8794, 2021. Du, X., Gozum, G., Ming, Y., and Li, Y. Siren: Shaping representations for detecting out-of-distribution objects. In Advances in neural information processing systems, volume 35, pp. 20434 20449, 2022. Du, X., Sun, Y., Zhu, J., and Li, Y. Dream the impossible: Outlier imagination with diffusion models. In Advances in Neural Information Processing Systems, volume 36, pp. 60878 60901, 2023. Duta, I. C., Liu, L., Zhu, F., and Shao, L. Improved residual networks for image and video recognition. In International Conference on Pattern Recognition, pp. 9415 9422, 2021. Dwork, C. and et al. Calibrating noise to sensitivity in private data analysis. Journal of Privacy and Confidentiality, 2006. Falorsi, L., De Cao, N., Li o, P., and et al. Explorations in geometry-aware neural networks for learning hyperspherical embeddings. In International Conference on Learning Representations, 2018. Hasnat, M. A., Bohn e, J., Milgram, J., Gentric, S., and Chen, L. von mises-fisher mixture model-based deep learning: Application to face verification. ar Xiv preprint ar Xiv:1706.04264, 2017. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in neural information processing systems, volume 33, pp. 6840 6851, 2020. Huang, W., Ren, Y., and Xu, Y. Riemannian diffusion models. ar Xiv preprint ar Xiv:2202.02763, 2022. Huang, X., Salaun, C., Vasconcelos, C., Theobalt, C., Oztireli, C., and Singh, G. Blue noise for diffusion models. In ACM SIGGRAPH Conference, pp. 1 11, 2024. Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. In Advances in neural information processing systems, volume 35, pp. 26565 26577, 2022. Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres Kingma, D. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Advances in neural information processing systems, volume 31, 2018. Kingma, D., Salimans, T., Ho, J., and Chen, X. Variational diffusion models. In Advances in neural information processing systems, volume 34, pp. 21696 21707, 2021. Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2021. Krause, J., Stark, M., Deng, J., and Fei-Fei, L. 3d object representations for fine-grained categorization. In Proceedings of the International Conference on Computer Vision workshops, pp. 554 561, 2013. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009. Langley, P. Crafting papers on machine learning. In International Conference on Machine Learning, pp. 1207 1216, 2000. Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision, 2015. Lui, Y. M. Advances in matrix manifolds for computer vision. Image and Vision Computing, 30(6-7):380 388, 2012. Luo, X., Wang, J., and Tang, H. Antigen design using generative diffusion models. In Proceedings of the National Academy of Sciences, 2021. Majumdar, A., Singh, R., and Vatsa, M. Face verification via class sparsity based supervised encoding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6): 1273 1280, 2017. doi: 10.1109/TPAMI.2016.2569436. Manchanda, S., Bhagwatkar, K., Balutia, K., Agarwal, S., Chaudhary, J., Dosi, M., Chiranjeev, C., Vatsa, M., and Singh, R. D-lord: Dysl-ai database for low-resolution disguised face recognition. Transactions on Biometrics, Behavior, and Identity Science, 2023. Mardia, K. V. and Jupp, P. V. Directional statistics, volume 2. Wiley Online Library, 2000. Matsubara, H. and Imai, M. Student-t distribution-based robust loss functions. In Proceedings of the Conference on Computer Vision and Pattern Recognition, 2021. Ming, Y., Sun, Y., Dia, O., and Li, Y. How to exploit hyperspherical embeddings for out-of-distribution detection? In International Conference on Learning Representations, 2023. Mittal, S., Thakral, K., Singh, R., Vatsa, M., Glaser, T., Canton Ferrer, C., and Hassner, T. On responsible machine learning datasets emphasizing fairness, privacy and regulatory norms with examples in biometrics and healthcare. Nature Machine Intelligence, 6:936 949, 2024. Pidstrigach, J. Score-based generative models detect manifolds. In Advances in neural information processing systems, volume 35, pp. 35852 35865, 2022. Rezende, D. J., Papamakarios, G., Racaniere, S., Albergo, M., Kanwar, G., Shanahan, P., and Cranmer, K. Normalizing flows on tori and spheres. In Proceedings of International Conference on Machine Learning, pp. 8083 8092. PMLR, 2020. Salimans, T. and Ho, J. Progressive distillation for fast sampling of diffusion models. ar Xiv preprint ar Xiv:2202.00512, 2022. Scott, T. R., Gallagher, A. C., and Mozer, M. C. von misesfisher loss: An exploration of embedding geometries for supervised learning. In Proceedings of the International Conference on Computer Vision, pp. 10612 10622, 2021. Shue, J. R., Chan, E. R., Po, R., Ankner, Z., Wu, J., and Wetzstein, G. 3d neural field generation using triplane diffusion. In Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 20875 20886, 2023. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of International Conference on Machine Learning, 2015. Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In Advances in neural information processing systems, 2019. Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. Vincent, P. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011. Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. Caltech-ucsd birds-200-2011 (cub-200-2011). Technical Report CNS-TR-2011-001, California Institute of Technology, 2011. Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z., and Liu, W. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 5265 5274, 2018. Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres Wang, Y., Xu, X., and Zhang, L. Diffusion models on spheres: Towards geometric generative modeling. In Advances in neural information processing systems, 2022. Watson, J., Nichol, A., and Dhariwal, P. Learning fast samplers for diffusion models by differentiating through sample quality. In International Conference on Learning Representations, 2022. Xu, X., He, X., and Wang, R. Rethinking noise in diffusion models: Alternatives to gaussian. In Proceedings of International Conference on Machine Learning, 2023. Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres A. Gaussian Noise Distorts Angular Relationship A key challenge in hyperspherical data modeling is preserving the angular relationships that define class structure, especially when noise is introduced. In many generative and transformation-based approaches, Gaussian noise is commonly used to perturb data points. However, in non-Euclidean spaces like the hypersphere, such noise can significantly distort the underlying geometric structure. Unlike structured perturbations that respect the manifold s constraints, isotropic Gaussian noise introduces deviations that shift points off the sphere and disrupt their relative angles. The following lemma formally establishes how Gaussian noise fails to maintain angular class structure, particularly in high-dimensional spaces where its effects become more pronounced. Lemma A.1 (Gaussian Noise Distorts Angular Class Structure). Let {xi}N i=1 be N points in Sd 1 (grouped into classes) such that the angular distances (or equivalently, dot products) between points reflect some inter-class and intra-class relationships (e.g. points in the same class are close in angle, points in different classes have larger angles). Define the perturbed points exi = xi + ϵi, ϵi N(0, σ2I), i = 1, . . . , N, where, ϵi are i.i.d. isotropic Gaussians in Rd. Then in general, the inner products ex i exj and x i xj are not preserved; i.e. with high probability, the angles among the perturbed points (ex1, . . . , ex N) are significantly different from those among the original points (x1, . . . , x N), especially as d grows and/or σ > 0 is large. Proof (Sketch): i) No longer on the hypersphere. Since ϵi = 0 almost surely, the perturbed points exi will not lie on Sd 1, so angles measured in Rd w.r.t. the origin are already changed. More precisely, exi 2 = xi + ϵi 2 1 + ϵi 2 + 2 (x i ϵi), which (with probability 1) is not equal to 1. Thus, any spherical relationships are broken immediately. ii) Inner products become random. Consider the inner product ex i exj = (xi + ϵi) (xj + ϵj) = x i xj + x i ϵj + x j ϵi + ϵ i ϵj. Since ϵi, ϵj are Gaussians with mean 0, each cross-term is a random variable whose distribution depends on σ2 and d. iii) High dimension amplifies distortion. In high d with σ2 > 0, we typically have ϵi d σ, so the energy in the noise vectors can overshadow the original norm ( xi = 1). Hence exi d σ, dominating any small angular differences that originally existed among the {xi}. Even for moderate d, if σ is large enough, exi and exj become nearly orthogonal or randomly oriented (depending on the sign and correlation among the noise). Thus, the relative angles among the perturbed points often bear little resemblance to the original class structure. iv) Conclusion. Because isotropic Gaussian noise in Rd shifts points off the hypersphere and injects random directions at scale σ, it fails to preserve the original spherical relationships (both inter-class angles and intra-class distributions). In fact, for large d or sufficiently large σ, the new angles are effectively random, destroying the class separation that was originally encoded in angles on Sd 1. B. Class Separation using v MF A fundamental challenge in generative modeling on hyperspheres is ensuring class separability while preserving the underlying geometric structure. The von Mises-Fisher (v MF) distribution provides a natural way to model directional data while maintaining angular coherence. Unlike Gaussian noise, which distorts class boundaries by introducing random Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres perturbations in Euclidean space, v MF-based modeling enforces directional consistency by concentrating samples around a mean direction with a tunable spread controlled by the concentration parameter κ. The following lemma establishes a probabilistic bound on class separability when data points from different classes are modeled using v MF distributions. It quantifies how the probability of misclassification depends on κ and the angular separation θ between class centers. This result provides a theoretical foundation for setting κ to achieve a desired classification accuracy and demonstrates that stronger concentration (higher κ) exponentially improves separation, reinforcing the effectiveness of v MF-based diffusion in preserving hyperspherical structure. Lemma B.1 (Class Separation with von Mises-Fisher Distributions). Consider two classes C1 and C2 on the unit hypersphere Sd 1, with mean directions µ1, µ2 Sd 1 separated by angle θ = arccos(µ 1 µ2). Suppose data points in each class follow von Mises-Fisher distributions with the same concentration parameter κ > 0: p(x|µi, κ) = Cd(κ) exp(κµ i x), i = 1, 2 (1) where, Cd(κ) is the normalizing constant. Then: (a) The probability of misclassification Pe (classifying a point from class 1 as belonging to class 2 or vice versa) is bounded above by: Pe exp( κ(1 cos θ)) (b) For any desired error rate ϵ > 0, setting the concentration parameter as: κ 1 1 cos θ log 1 guarantees that Pe ϵ. Proof. 1. For a point x drawn from class 1, i.e., x v MF(µ1, κ), the probability density at angle ϕ from µ1 is: p(ϕ) = Cd(κ) exp(κ cos ϕ) 2. Misclassification occurs when x is closer to µ2 than to µ1. Byhyperspherical geometry, this happens when the angle ϕ between x and µ1 exceeds θ/2. 3. Therefore, the misclassification probability is: θ/2 p(ϕ) sind 2(ϕ)dϕ where, sind 2(ϕ)dϕ is the surface element on Sd 1. 4. For ϕ θ/2, we have cos ϕ cos(θ/2), thus: θ/2 Cd(κ) exp(κ cos ϕ) sind 2(ϕ)dϕ Cd(κ) exp(κ cos(θ/2)) Z π θ/2 sind 2(ϕ)dϕ exp( κ(1 cos θ)) 5. For part (b), solving exp( κ(1 cos θ)) ϵ yields the required bound on κ. Implications: (1) The bound tightens exponentially with κ. (2) Larger angular separation θ allows for smaller κ. (3) For fixed ϵ, required κ scales inversely with class separation. Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres C. Variational Bound In typical hyperspherical diffusion, the reverse process attempts to invert the forward noising so that data points return precisely to their original directions. By contrast, certain tasks (e.g., class-conditional generation) may demand a less rigid constraint: as long as the final sample lies within a small hypercone around a class-specific prototype, the objective is satisfied. We formalize this idea by modifying the standard diffusion variational bound to allow for hypercone-constrained convergence in the reverse process. C.1. Forward Process on the Hypersphere We assume data z0 Sd 1 are sampled from some distribution q(z0). The forward noising process gradually transforms z0 into an approximately uniform distribution on the hypersphere by injecting von Mises Fisher noise: q(zt | zt 1) = v MF zt 1, κt , t = 1, . . . , T (2) where, v MF(µ, κ) denotes a v MF distribution on Sd 1 with mean direction µ and concentration parameter κ 0. For large t, κt 0, causing the distribution q(z T | z0) to approach uniform on Sd 1. Hence, the complete forward chain for (z0:T ) is: q(z0:T ) = q(z0) t=1 q(zt | zt 1) (3) Our goal is then to reverse this process, recovering z0 from a noisy z T . C.2. Class-Specific Hypercones and the Reverse Process Hypercone Definition. We consider C distinct classes, each identified by a direction µc Sd 1 and an angular radius θc 0. The class hypercone Cc(θc) is then defined as: Cc(θc) = z Sd 1 : (z, µc) θc (4) Thus, each class c corresponds to all directions on a hypersphere within angular distance θc of µc. We introduce a class-conditional reverse process that moves from z T to z0 in T steps: pθ(z0:T | y) = pθ(z0 | z1, y) t=1 pθ(zt | zt+1, y), (5) where, y {1, . . . , C} is the class label. For each intermediate t 1, we let pθ(zt 1 | zt, y) = v MF mθ(zt, y), κt , (6) where, mθ(zt, y) is a (learned) unit vector on the hypersphere and κt is a (possibly deterministic or learned) concentration. Class Hypercone at t = 0. Instead of insisting that z0 exactly match the original data direction, we only require that z0 lie in the appropriate class hypercone Cy(θy). Hence, we can define: pθ(z0 | z1, y) = (truncated v MF with support in Cy(θy) , (7) so that z0 / Cy(θy) has zero probability. Equivalently, one may define a suitable parametric distribution that is sharply peaked around µy but has finite support within angle θy. C.3. Variational Bound with Hypercone Constraint Let z0 belong to class y. We seek to maximize pθ(z0 | y) (the likelihood of reconstructing z0 within the correct hypercone). A standard approach introduces the forward chain as a variational distribution and applies Jensen s inequality. Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres C.4. Evidence Lower Bound (ELBO) Derivation We start from: log pθ(z0 | y) = log Z pθ(z0:T | y) dz1:T = log Z q(z1:T | z0, y) q(z1:T | z0, y) pθ(z0:T | y) dz1:T , (8) where, q(z1:T | z0, y) = QT t=1 q(zt | zt 1, y), but note that in practice q(zt | zt 1, y) usually coincides with q(zt | zt 1) if the forward noising is class-agnostic. Applying Jensen s inequality to the expression inside : log pθ(z0 | y) = log Eq(z1:T |z0,y) pθ(z0:T | y) q(z1:T | z0, y) Eq(z1:T |z0,y) log pθ(z0:T | y) q(z1:T | z0, y) Hence we define the negative ELBO: L(θ) = Eq(z0:T |z0,y) h log q(z1:T | z0, y) pθ(z0:T | y) i , so that log pθ(z0 | y) L(θ) (10) C.5. Decomposition By writing out q(z1:T | z0, y) and pθ(z0:T | y) explicitly, we get: q(z1:T | z0, y) = t=1 q(zt | zt 1, y), pθ(z0:T | y) = pθ(z0 | z1, y) t=1 pθ(zt | zt+1, y) log q(z1:T | z0, y) pθ(z0:T | y) = log pθ(z0 | z1, y) + h log q(zt | zt 1, y) log pθ(zt | zt+1, y) i (11) Taking expectation under q(z0:T | z0, y) yields: L(θ) = Eq(z0:1|z0,y) log pθ(z0 | z1, y) | {z } (reconstruction into hypercone) t=1 Eq(z0:T |z0,y) h log q(zt | zt 1, y) log pθ(zt | zt+1, y) i | {z } (KL terms between forward v MF and reverse v MF) Hypercone Constraint. The key difference from standard hyperspherical diffusion is that pθ(z0 | z1, y) is constrained to Cy(θy), i.e. we ensure that z0 stays within angular distance θy of the class mean µy. Hence the reconstruction term log pθ(z0 | z1, y) does not drive the model to a single point µy, but rather to the full hypercone Cy(θy). Mathematically, this can be implemented by a truncated v MF distribution or any parametric distribution that has zero probability outside (z, µy) > θy. C.6. Resulting Objective Combining the above, we arrive at: log pθ(z0 | y) L(θ) = Eq(z1:T |z0,y) h log pθ(z0 | z1, y) log pθ(zt | zt+1, y) log q(zt | zt 1, y) i (13) If each pθ(zt | zt+1, y) is a v MF(mθ, κt) and each q(zt | zt 1, y) is v MF(zt 1, κt), then each KL term has a known closed form. The final t = 0 term effectively enforces that z0 remains in the class hypercone. Thus, the chain converges to a distribution localized around µy, with an angular radius θy, rather than collapsing to a single direction. Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres D. Uncertainty Modelling Lemma D.1 (Equivalence of Angular Interpolation and v MF Diffusion). Let zt Sd 1 be generated by either: 1. Angular interpolation: zt = cos(θt)zt 1 + sin(θt)v, where v Uniform(Sd 1) 2. v MF sampling: zt v MF(zt 1, κt) For κt = cot(θt), these processes generate equivalent distributions over the hypersphere. Proof. Under angular interpolation, θt = 0 gives zt = zt 1 (perfect preservation), while θt = π/2 gives zt = v (uniform noise). For v MF with κt = cot(θt), these correspond to κt (perfect concentration) and κt 0 (uniform distribution) respectively. The equivalence follows from the conditional density: p(zt|zt 1) exp(cot(θt)z t 1zt) which matches the v MF density f(zt; zt 1, κt) exp(κtz t 1zt) when κt = cot(θt). Implications: This equivalence offers geometric (angular interpolation) and probabilistic (v MF) views of the forward process, with κt = cot(θt) ensuring compatibility with Smooth progression θt : 0 π/2 matches κt : 0 Lemma D.2 (Concentration of v MF Reverse Process into a Class Hypercone). Suppose we have a discrete-time reverse Markov chain {zt}0 t=T Sd 1 defined by zt 1 v MF Π zt + ηt ztlog f(zt; µc) , κt (14) 1. Π(x) := x/ x is the projection onto the unit hypersphere Sd 1. 2. ztlog f(zt; µc) is the gradient (score function) of a density f(z; µc) that is sharply peaked around the class mean µc Sd 1. In particular, this gradient points largely in the direction of µc zt (where u = µc zt ) whenever zt is not too close to µc. 3. κt (the v MF concentration) increases over time, and ηt 0 at a suitable rate (e.g. ηtκt but ηt 0 as t 0). Define the class hypercone of half-angle α > 0 around µc by: Cα(µc) := n u Sd 1 u µc cos(α) o Then under mild smoothness assumptions on f and the above monotonicity/decay rates for κt and ηt, the chain converges (in distribution) into the hypercone Cα(µc) as t 0. Specifically, for any α > 0, lim t 0 P zt Cα(µc) = 1 Equivalently, the angle between zt and µc goes to zero with high probability. Sketch of Proof. We outline the main arguments: 1. Directional Gradient Alignment. By assumption, ztlog f(zt; µc) points generally toward µc for zt not close to µc. Hence the update zt + ηt ztlog f rotates zt closer in angle to µc. After projection Π( ) to the hypersphere, this remains a unit vector closer to µc than zt was. 2. Sharp Concentration under v MF. As κt , v MF m, κt puts most of its mass around m, with the variance of the angular distribution shrinking like 1/κt. Thus, if the mean direction Π zt + ηt ztlog f is already within α of µc, then with high probability the new sample zt 1 remains near µc. Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres 3. Iteration and Convergence. Given that ηt 0, the movement per step in the hypersphere s tangent space shrinks, preventing large excursions away from µc. Meanwhile, κt (the concentration) grows so that each step s sample is drawn from an increasingly peaked distribution. Iterating backward from t = T down to t = 0, the probability of zt being outside any cone Cα(µc) diminishes at each step. By the last steps near t = 0, zt concentrates with high probability in the chosen hypercone around µc. Thus, in the limit t 0, we conclude that zt converges in distribution to directions arbitrarily close to µc, i.e. within any desired half-angle α. Equivalently, the angle (zt, µc) goes to zero with high probability, implying limt 0 z t µc = 1 almost surely. Algorithm 3 Hypercone-Constrained Sampling with Learned Truncation Require: Class label y, diffusion steps T Require: Angular schedule {θt}T t=1, step sizes {ηt}T t=1 1: Sample z T Uniform(Sd 1) 2: for t = T to 1 do 3: mt Dϕ(zt, t, y) Predict direction 4: θy Cψ(zt, t, y) Predict angle 5: Ct,y {z : (z, mt) θy} 6: ϕt (zt, mt) 7: κt κmaxσ(β[θy ϕt]) 8: if t > 1 then 9: zt log f Score Netθ(zt, t, y) 10: ut zt + ηt zt log f 11: ut ut/ ut 12: zt 1 Tv MF(ut, κt, Ct,y) 13: end if 14: end for 15: return Final sample z1 E. Brownian Motion on Hypersphere Recent advances in continuous-time score-based generative models (Song et al., 2021; Karras et al., 2022) suggest viewing the forward noising process as an SDE, whose time-reversal recovers the data distribution. When data reside on a hypersphere Sd 1, an analogous approach involves constructing a Brownian motion restricted to Sd 1, then reversing it to produce samples from the original (or a conditional) distribution. Concretely, let z(t) Sd 1 evolve for t [0, T] such that it follows an intrinsic Brownian motion on the hypersphere. In the simplest form, 2 σ2(t) Pz(t) dw(t), (15) where, σ(t) 0 is a noise scale (or diffusion coefficient) and Pz(t) projects Rd increments dw(t) onto the tangent space at z(t) Sd 1. Spherical Forward Process and Hyperspherical Score Estimation. By choosing σ(t) such that the marginal distribution of z(T) approaches the uniform measure on Sd 1, we obtain a continuous analog of spherical diffusion. In practice, one can incorporate a small drift term b(z, t) to ensure that q(z(T)) is nearly uniform, similar to the variance-preserving or variance-exploding schedules in Euclidean SDEs (Song et al., 2021). Concurrently, we learn a score network sθ(z, t) that approximates z log qt(z), using a spherical variant of score matching (Vincent, 2011). Reverse-Time SDE and Generative Sampling. Once sθ is trained, we generate new samples by solving the reverse-time SDE. Formally, time reversal of the forward process (15) yields dz(t) = σ2(t) sθ(z, t) dt + p 2 σ2(t) Pz(t) dw(t), Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres where, t [T, 0] and dw(t) is again Brownian noise on the tangent space, and we solve backward from z(T) Uniform(Sd 1) down to z(0). Intuitively, the term σ2(t) sθ(z, t) acts as a drift that guides samples toward the data manifold on the hypersphere. Comparison to v MF Diffusion. In discrete-time spherical diffusion using v MF noise , each forward step is v MF zt 1, κt , while the reverse step estimates v MF( , κt) with a learned center. Spherical Brownian motion (15) can be viewed as the continuous limit of many small v MF perturbations. Inversely, discretizing (E) via Euler Maruyama method yields a small-angle v MF reverse step that remains tangent to Sd 1. Lemma E.1. Discrete-Continuous Correspondence Let q t(zt+ t|zt) be a v MF transition with concentration κ t. As t 0 with κ t = O(1/ t), the process converges weakly to the solution of the spherical Brownian motion SDE (15). Handling Class Hypercones and Adaptive Freezing. Finally, to respect class geometry or hypercone constraints, one may introduce a drift term that becomes very large (or a reflection boundary) whenever z(t) moves outside the class-constrained cone. Alternatively, if each class y has a known center µy and angular tolerance θy, the reverse SDE (E) can incorporate a penalty encouraging (z, µy) θy, effectively locking ( constraining to a sub-manifold like a hypercone on the sphere) trajectories in the appropriate region of Sd 1. Such mechanisms ensure the final sample remains class-consistent while leveraging the flexibility and elegance of spherical Brownian motion as the underlying diffusion. F. Beyond Geometry and Uncertainty: Retaining Image Information For the experiment with two diffusion process, the forward (noising) step at time t can be viewed as: αt = αt 1 + σ ϵt ; dt v MF dt 1, κt where, σ ϵt is a small Gaussian perturbation (e.g. ϵt N(0, 1)) controlling how the magnitude spreads, and dt is sampled from a v MF distribution centered at dt 1 with concentration κt. As t increases, κt may decrease (making directions more diffuse), while σ can enlarge the variance of αt, allowing the embedding norm to fluctuate more widely. Reverse (Denoising) Process. We define the reverse chain to invert both magnitude and direction back to the original class-consistent configuration. To invert this process, we learn parameters θ that predict a suitable Gaussian mean for the magnitude and a suitable mean direction for the v MF. Formally, we define bαt 1 N µθ αt, dt , eσ2 , bdt 1 v MF mθ αt, dt , eκt , xt 1 = bαt 1 bdt 1 Here, µθ( ) is a neural network output that infers the ideal norm given the current αt and dt, while mθ( ) Sd 1 is another network output that estimates the best directional center for the v MF. This design provides a powerful way to incorporate both global image information and class geometry: we learn how to denoise both magnitude (via the Gaussian) and direction (via v MF), thus recovering the class-relevant structure on the hypersphere encoded in the angular direction while preserving essential image information encoded in α. This dualuncertainty approach ensures that face embeddings can vary naturally in intensity or brightness, yet remain consistent with class geometry, capturing both how bright an image is and which identity it belongs to. Theorem F.1. Dual-Uncertainty Preservation Let x0 Rd with α0 = ||x0|| and d0 = x0/||x0||. Under the forward process: αt = αt 1 + σϵt, ϵt N(0, 1) dt v MF(dt 1, κt) xt = αtdt Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres The following holds for all t 0: E[αt α0] = 0 Var(αt α0) = tσ2E[dt d0] = where, Ad(κ) = Id/2(κ)/Id/2 1(κ) is the ratio of modified Bessel functions. Moreover, p(xt|x0) = p(αt|α0)p(dt|d0) with dt Sd 1 almost surely. G. Adaptive Class-based Hypercone Learning Class Hypercone Setup. Each class y {1, . . . , C} is associated with a direction µy Sd 1 and an angular radius θy 0. Hence, the class hypercone is defined by: Cy(θy) = z Sd 1 : z, µy θy (16) Forward (Noising) Process. We consider a forward chain of length T, where each step injects v MF noise: q(z1:T | z0) = t=1 q zt | zt 1 , where q zt | zt 1 = v MF zt 1, κt (17) Here, κt is a (potentially decreasing) schedule that pushes zt toward uniform on Sd 1 as t grows. Adaptive Reverse with Learned Concentration. Instead of using a fixed reverse schedule, we let κθ(zt, y) : Sd 1 {1, . . . , C} R 0 (18) be a learned function of the current state zt Sd 1 and the class y. We define the reverse model as pθ(z0:T | y) = p z T T Y t=1 pθ zt 1 | zt, y , where pθ zt 1 | zt, y = v MF mθ(zt, y), κθ(zt, y) (19) Here, p z T is uniform on the hypersphere (i.e. the limiting v MF with zero concentration), and mθ(zt, y) Sd 1 is a learned mean direction. The key distinction is that κθ zt, y can grow large once zt is in the correct hypercone, effectively freezing further denoising. Example of a Freezing Mechanism. Let α(zt, y) = zt, µy and κθ(zt, y) = κmax σ β θy α(zt, y) , (20) where, σ( ) is a monotonic squashing function (e.g. sigmoid), κmax is a large positive constant, and β > 0 controls slope. If α(zt, y) θy, then κθ(zt, y) saturates near κmax, locking zt 1 into the hypercone Cy θy by making the v MF distribution highly concentrated. H. Experimental Details For feature extraction, we employ different architectures based on the domain: facial representations are obtained using a pre-trained Arc Face (Deng et al., 2019) model with i Res Net50 (Duta et al., 2021) backbone, while object categories use a CNN-based feature extractor, both producing hyperspherical embeddings. The latent space is configured with dimensions 32 32 3, balancing detail and computational efficiency. Our diffusion process uses an angular parameter θ that progresses from 0 to π/2 across diffusion steps, with the concentration parameter κ derived as cot(θ), naturally transitioning from high concentration (κ >> 1 at θ = 0) to a uniform distribution. For class-conditional generation, we utilize a context dimension of 512 to encode class information. The class specific hypercone constraints are implemented through adaptive κ Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres Samples under 𝜅 with angle Samples under 𝜅 with angle 𝛽 Samples under 𝜅 with angle 𝛾 Samples under 𝜅 with angle 𝜆 Where < 𝛽 < 𝛾 < 𝜆 Hypercone angle: 𝛾 Hypercone Coverage Ratio Measures amount of samples outside original class hypercone 0 <= HCR <= 1 Lower HCR : Better Perseverance Higher HCR : Angular distortion Hypercone Difficulty Skew Measures spread of samples across multiple sub-hypercone for a class 0 <= HDS <= 1 Lower HDS : Better generalization Higher HDS : Simplicity Bias Figure 8. Geometric interpretation of proposed metrics HCR and HDS. values, ranging from 0 to a class-specific maximum determined by the class angular radius θc, which is predicted by a UNet architecture. The maximum κ follows κmax = log(1/ϵ)/(1 cos(θc)), ensuring generated samples respect class-specific geometric structure. For training, we use the Adam optimizer with a learning rate of 1e-4 and batch size of 128, with the model trained for 100K iterations on a single NVIDIA A100 GPU. For comparison with Gaussian based diffusion, we used Variance preserving variant of diffusion. H.1. Analysis of metrics The Hypercone Coverage Ratio (HCR) and Hypercone Difficulty Skew (HDS) together offer a comprehensive framework for evaluating generative models in terms of their alignment with the original class distribution and their ability to handle sample difficulty. The HCR primarily assesses whether the generative model preserves the class structure by examining the percentage of generated samples that fall outside the class s hypercone. A lower HCR suggests that the model is generating samples that remain within the expected angular region of the class distribution, indicating that it faithfully reproduces the structure of the original class. On the other hand, a higher HCR implies that the model is producing out-of-class or unrealistic samples that deviate significantly from the class s expected region. This could be a sign of overfitting or a failure to learn the true distribution of the class, leading to unrealistic or poorly sampled outputs. The HDS, in contrast, focuses on the model s ability to balance difficulty across the generated samples. By dividing the class s hypercone into multiple sub-cones based on increasing angular thresholds, HDS captures how well the model distributes its generated samples across regions of varying difficulty. The model is expected to generate a mix of easy (close to the class mean) and hard (further from the mean) samples, reflecting the full diversity of the class. A high HDS indicates that the model primarily generates easy samples clustered in the innermost sub-cones. This could suggest a bias towards overfitting simpler patterns. This bias is undesirable because it fails to capture the full range of complexity in the class distribution. Conversely, a low HDS suggests that the model is distributing its samples more evenly across different difficulty levels, which is indicative of a more robust generative process that captures both simple and complex variations in the class. Together, these two metrics help to paint a fuller picture of a generative model s performance. A well-balanced generative model should ideally have a low HCR, reflecting good preservation of class boundaries, and a moderate HDS, indicating that it generates a variety of sample difficulties, capturing the full complexity of the class distribution. A high HCR combined with a low HDS might indicate that the model is overly focused on easy samples, while a high HDS with a low HCR could suggest that the model is struggling to maintain class consistency. Thus, an optimal model should strike a balance, maintaining a good coverage of the class s angular space without overfitting to simpler, easier samples. H.2. Results Reverse Denoising Comparison The results presented in Tables 3 to 5 provide a comparative analysis of Euclidean versus angular-based reverse denoising strategies across three datasets: CIFAR-10, MNIST, and D-LORD. For all datasets, we observe that angular-based methods, particularly those using cosine and geodesic formulations, consistently offer Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres improvements or comparable performance across key metrics. Specifically, angular approaches yield lower FID scores, indicating better sample quality. For example, FID improves from 3.52 to 3.28 on CIFAR-10 and from 1.86 to 1.77 on MNIST. In terms of Hypercone Coverage Ratio (HCR), angular losses preserve class structure similarly or better than MSE-based approaches, while Hypercone Difficulty Skew (HDS) values suggest that angular methods produce a more balanced range of sample difficulty. These improvements highlight that angular loss functions better align with the intrinsic hyperspherical nature of feature embeddings, enhancing both generation quality and class consistency. The consistency of these trends across datasets confirms the generalizability of angular reverse denoising in hyperspherical diffusion models. Table 3. Comparative analysis of Euclidean and Angular based reverse denoising step for CIFAR-10 dataset. Various score functions are also used for evaluation. Euclidean with MSE Angular with Cosine Angular with Geodesic FID 3.52 3.28 3.35 HCR 0.20 0.20 0.19 HDS 0.48 0.51 0.47 Table 4. Comparative analysis of Euclidean and Angular based reverse denoising step for MNIST dataset. Euclidean with MSE Angular with Cosine Angular with Geodesic FID 1.86 1.79 1.77 HCR 0.14 0.15 0.14 HDS 0.52 0.47 0.48 Table 5. Comparative analysis of Euclidean and Angular based reverse denoising step for D-LORD dataset. Euclidean with MSE Angular with Cosine Angular with Geodesic FID 9.27 9.01 8.97 HCR 0.21 0.19 0.20 HDS 0.62 0.61 0.62 Effect of Scheduling κ Scheduling κ controls the rate at which class structure degrades, ensuring a smooth transition to uniform noise. Without scheduling, any fixed κ eventually results in a uniform distribution on the hypersphere, particularly as T increases. Gradually decaying κt preserves intra-class structure longer, aiding recovery during the reverse process. Formally, for dt v MF(dt 1, κt), the marginal distribution approaches uniformity as κT 0 p(d T ) 1 |Sd 1|. Empirical results on the effect of κ scheduling are shown in Table 6. Feature Representation Figure 9 illustrates the feature representations of conditional samples generated from the 10 classes of the CIFAR-10 dataset. The figure highlights that the v MF-based reverse sampling effectively converges samples within class-specific hypercones, capturing the angular geometry of the data. In contrast, Gaussian-based reverse sampling produces samples that converge within Euclidean space, failing to adhere to the hyperspherical structure. Facial Data Generation The Figure 10 illustrates the generated images with diversity and occlusion facial challenges present, which is critical for robust face recognition under real-world conditions. The top section presents multiple views of the same subject generated without occlusion, showing typical intra-subject variation. The middle section focuses on cases with eye-region occlusion caused by sunglasses, while the bottom section includes examples of full occlusion from multiple accessories such as hats and scarves. Hypercone specific generation Figure 12 and Figure 13 shows samples generated from inner and outer hypercone based on class-specific θk that determines the boundary of class. As demonstrated by the figure, the images sampled from inner hypercone are sharp and realistic while samples generated around the boundary are noisy. Also, various samples generated are shown in Figure 11 I. Applications The applications of the proposed v MF-based angular diffusion are: Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres Table 6. Effect of using kappa scheduler. Comparative analysis of using the kappa scheduler on two datasets for various evaluation metrics Metric CIFAR-10 MNIST With scheduling Without scheduling With scheduling Without scheduling Class-wise Accuracy 89.35 72.59 96.01 86.48 FID 3.52 6.28 1.86 2.11 HCR 0.20 0.37 0.14 0.21 HDS 0.48 0.63 0.52 0.75 (a) Gaussian Diffusion (b) v MF based Diffusion Figure 9. Feature representation of the CIFAR-10 dataset generated using Gaussian-based diffusion (left) and v MF-based diffusion (right). The v MF-based sampling aligns generated sample features within class-specific 3D hypercones, while Gaussian-based sampling results in scattered features outside the class-hypercones. Few-shot learning: Our approach improves performance by generating more diverse and class-consistent samples from limited data. Fairness and bias mitigation: Manifold-aware generation allows controlled augmentation to rebalance datasets across demographics, reducing biases. Face recognition robustness: Explicitly preserving directional structures helps models robustly handle variations (occlusion, illumination, pose). Difficult sample generation: Controlled angular diffusion produces challenging samples near class boundaries, refining decision boundaries and improving model reliability. Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres Figure 10. Facial data synthesis demonstrates variations across occlusion, pose and resolution. (b) CIFAR-10 Figure 11. Generated samples from the proposed v MF-based diffusion model trained on (a) MNIST and (b) CIFAR-10 datasets. The samples effectively preserve class-specific information while maintaining high visual quality. Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres Inner Hypercone Outer Hypercone Figure 12. Generated samples from the inner and outer hypercones using the proposed v MF-based diffusion model trained on the MNIST dataset. Samples from the outer hypercone exhibit noticeable noise and distortions. Harmonizing Geometry and Uncertainty: Diffusion with Hyperspheres Inner Hypercone Outer Hypercone Figure 13. Generated samples from the inner and outer hypercones using the proposed v MF-based diffusion model trained on the CIFAR-10 dataset. Inner hypercone samples exhibit superior quality and effectively preserve class-specific information.