# controllable_data_generation_with_hierarchical_neural_representations__fe709785.pdf Controllable Data Generation with Hierarchical Neural Representations Sheyang Tang 1 Xiaoyu Xu 1 Jiayan Qiu 2 Zhou Wang 1 Implicit Neural Representations (INRs) represent data as continuous functions using the parameters of a neural network, where data information is encoded in the parameter space. Therefore, modeling the distribution of such parameters is crucial for building generalizable INRs. Existing approaches learn a joint distribution of these parameters via a latent vector to generate new data, but such a flat latent often fails to capture the inherent hierarchical structure of the parameter space, leading to entangled data semantics and limited control over the generation process. Here, we propose a Controllable Hierarchical Implicit Neural Representation (CHINR) framework, which explicitly models conditional dependencies across layers in the parameter space. Our method consists of two stages: In Stage-1, we construct a Layers-of-Experts (Lo E) network, where each layer modulates distinct semantics through a unique latent vector, enabling disentangled and expressive representations. In Stage-2, we introduce a Hierarchical Conditional Diffusion Model (HCDM) to capture conditional dependencies across layers, allowing for controllable and hierarchical data generation at various semantic granularities. Extensive experiments across different modalities demonstrate that CHINR improves generalizability and offers flexible hierarchical control over the generated content. 1. Introduction Implicit Neural Representations (INRs) are powerful tools to represent complex data as continuous functions with neural networks (Tancik et al., 2020; Mildenhall et al., 2021; 1Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Canada 2School of Computing and Mathematical Sciences, University of Leicester, Leicester, United Kindom. Correspondence to: Xiaoyu Xu . Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s). You et al., 2024), offering compact and universal representations across diverse data modalities such as audio (Su et al., 2022), images (Sitzmann et al., 2020; Dupont et al., 2021), videos (Chen et al., 2021a; 2023), and 3D volumes (Mildenhall et al., 2021; Zhao et al., 2022; Michalkiewicz et al., 2019). By modeling data as functions fθ : X F, with X and F being the input (e.g., pixel coordinates) and output (e.g., RGB values), INRs implicitly encode data as a hidden manifold within the parameter space of θ, capturing the underlying structure of the data. By modeling the distribution of parameters p(θ), generative INRs present promising potentials for universal data generation (Dupont et al., 2022c;a; Bauer et al., 2023; You et al., 2024). Nevertheless, two fundamental questions have long been overlooked: How are these parameters related to data semantics, and hence how to control the parameters to generate expected semantics? Addressing this gap is critical for advancing the controllability of INR-based generative frameworks. Notably, we observe that INRs (e.g., SIREN (Sitzmann et al., 2020)) naturally exhibit a hierarchical structure, where each layer progressively expands the representational capacity of the model (Y uce et al., 2022). This expansion is intrinsically linked to the frequency basis of INRs: earlier layers capture coarse-grained features, while later layers progressively refine fine-grained details (Section 2.2 provides detailed analysis). As shown in Fig. 1, this progression aligns with the semantic abstraction hierarchy in data. For example, in facial images, hierarchical semantics are reflected in progressively detailed facial attributes, such as overall facial shape, expression, and shape of eyes. This connection between INR s hierarchical structure and data semantic abstraction offers a natural pathway to achieve hierarchical control. However, existing generative INR approaches (Dupont et al., 2022a; Bauer et al., 2023; You et al., 2024) overlook this hierarchy, instead modeling the joint distribution of flattened INR parameters, either directly as p(θ) or through the distribution of condensed latents. It ignores the correspondence between the INRs expanded representation capacity and the hierarchical semantics in the data. This misalignment brings two challenges: (1) Control over the generated content is limited. While generative models, such as latent diffusion, are employed with INRs to generate new Controllable Data Generation with Hierarchical Neural Representations Figure 1. Hierarchical generation of universal data modality. Top: expanded representational capabilities of a Layers-of-Experts INR model. Bottom: aligning these capabilities with hierarchical data patterns enables precise control over the generation process. Each column presents generated samples resulting from a divergence in routing at a specific layer, with arrows indicating the shared routing in the preceding layers. data, they cannot link the sampled noise to the expected semantics in the output. (2) The generalizability of INRs to unseen data is impaired. The joint distribution learning encodes the entangled semantics together into one flat latent, where the co-occurrence of certain semantics is inevitable. Therefore, the diversity of the generated data is limited. To address this gap, we propose a Controllable Hierarchical Implicit Neural Representation (CHINR) framework that investigates the hierarchical structure of INRs, promoting layer-wise control in the generation process. Our method starts by training a collection of INRs on a dataset. Each layer of an INR is parameterized as a Mixture-of-Experts (Mo E) layer to increase expressivity, where a set of expert weights and latents are learned. The experts at each layer are shared across the dataset, while the latents are data-specific, routing the data flow and modulating the contribution of experts. Layers of Mo E are cascaded to form an INR, which we call a Layers-of-Experts (Lo E) network. Consequently, a Lo E with L layers will have L latents adapted to the fitted data, effectively capturing and relating its complex patterns with layers of latents. By modeling the conditional dependencies of these layer-wise latents with a hierarchical conditional diffusion model (HCDM), we maintain the hierarchical structure of INRs. This unlocks a controllable generation process, aligning the layers in INRs with the hierarchies of data semantics for the first time. As illustrated in Fig. 1, the data flow for a generation process resembles a tree-like structure: the routing in the next layer is constrained by the paths in previous layers, allowing full control of where a different routing strategy should be explored. Early deviations in routing lead to significant semantic differences in the generated content, while a later deviation results in minor differences in details. The contributions of our paper are summarized as follows: We are the first to achieve hierarchical control in generative INRs by modeling hierarchical and conditional dependencies in INR parameters. This enables layer-wise, precise control over data semantics during generation. We model the INR as a Lo E framework to enhance expressivity. This design aligns the inherent hierarchy of data semantics with INR s expanded representation capacity, enabling the generation of diverse data. The proposed CHINR shows broad versatility across modalities, highlighting the broad applicability of hierarchical control and conditional dependency modeling to data semantics with inherent hierarchies. We conduct extensive experiments across various modalities, demonstrating CHINR s superior performance in reconstruction and generation metrics. 2. Background In this section, we introduce INR and generative INRs, while highlighting their connections to our proposed approach. We also analyze the inherent hierarchy in INR s parameters. 2.1. Implicit Neural Representation & Generative INRs Implicit Neural Representations (INRs) parameterize data such as audio, images, video, and 3D voxels as mappings from coordinates to signals, enabling a unified framework for various data modalities (Genova et al., 2019a;b; Xie Controllable Data Generation with Hierarchical Neural Representations et al., 2022). Remarkable progress has been made to enhance the representation quality, efficiency and compactness for audio (Zuiderveld et al., 2021; Luo et al., 2022; Su et al., 2022; Lanzend orfer & Wattenhofer), images (Sitzmann et al., 2020; Fathony et al., 2020; Chen et al., 2021b; Xu et al., 2022a; Saragadam et al., 2023; Yu et al., 2024; Xu et al., 2024b; 2022b; Qiu et al., 2021), 3D contents (Mildenhall et al., 2021; Barron et al., 2021; Tiwari et al., 2022; Ortiz et al., 2022; Zhao et al., 2022; Ruan et al., 2024; Qiu et al., 2020; 2019; Yang et al., 2020), and videos (Chen et al., 2021a; Li et al., 2022; Yan et al., 2024). Despite performing well on different modalities, INRs struggle to generalize to multiple and unseen data, as each instance is typically overfitted with a separate MLP. To address this, two key strategies have emerged: (1) learning content-specific input features (Yu et al., 2021; Hu et al., 2023; Lazova et al., 2023) and (2) modulating or customizing network parameters with latents or hypernetworks (Mehta et al., 2021; Wang et al., 2022; Dupont et al., 2022b; Kim et al., 2023; Xu et al., 2024a). Generative models (Goodfellow et al., 2014; Ho et al., 2020) further extend INR s capability to generate new data. GRAF (Schwarz et al., 2020) and GIRAFFE (Niemeyer & Geiger, 2021) generate shape and appearance codes from noise, which are combined with coordinates to construct scenes. Erkoc et al. (2023) use a diffusion model to generate INR weights. Dupont et al. (2022c); Du et al. (2021); Koyuncu et al. (2023) train hypernetworks to generate INR parameters. Dupont et al. (2022a); Bauer et al. (2023); Park et al. (2024) employ a two-stage framework to learn the distribution of latents that map to or modulate INRs, and generate new content by sampling in the latent space. m NIF (You et al., 2024) further enhances the expressivity of INR via model averaging. These methods essentially model the distribution of INR parameters p(θ) by learning the latent distributions p(h), but fail to capture the layer-wise hierarchical structure of INR parameters (Section 2.2), limiting their ability to accurately model distributions and control generation. Building on the latent modulation approach, we introduce a hierarchical conditional diffusion model, capturing dependencies between layer-wise latents for improved generalization and control. 2.2. Hierarchy Analysis of INR In this section, we review the INR architecture and analyze its inherent hierarchical representation ability. Using SIREN (Sitzmann et al., 2020) as an example, a two-layer SIREN is generally formulated as: fθ(x) = W2 sin(W1 γ(x)), θ = [W1, W2], (1) where γ(x) = sin(Ω x), Ω Rc1 cin denotes positional embedding of coordinates x, W2 Rcout c2, W1 Rc2 c1 denote the parameters of each layer. The bias is omitted for simplification. From the perspective of Fourier Transform, the input frequency domain Ωis composed of c1 frequency basis, Ω= [Ω0, Ω2, , Ωc1 1]. According to the Tancik et al. (2020) and Y uce et al. (2022), an MLP layer with periodic activation sin( ) only expands the input frequency basis in a sparse and limited bandwidth. The equation 1 can be reformulated as: w H(Ω) αw sin(w x), c=0 Jsc(W1[ ,c]) c=0 βcΩc|βc Z & where Jsc denotes a Bessel function, W1[ ,c] denotes the column c of W1. The Eq. (2) reveals the properties of each sin( ) activated INR layer in two aspects. First, the output spectrum of layer 2, i.e. αw , is dependent on the spectrum of layer 1, determined by W1; Second, the output frequency domain H(Ω) is sparse since βc is an integer, so H(Ω) only covers sparse frequency space spanned by the basis {βcΩc}. These suggest that INR layers spectrum and frequency basis inherently exhibit a sparse and hierarchical structure, encoded by θ, which extends to their representation ability. Latent-modulation approaches like m NIF (You et al., 2024), which model the parameter distribution p(θ) with the surrogate task of modeling the latent distribution p(h), overlook the hierarchy in h transferred from θ. Ignoring this hierarchy leads to reduced expressivity and generalizability, and limited control over the generation process. 3. Proposed Method Our method uses a two-stage framework to align the hierarchy of data semantic and INR s expanded representation ability, as shown in Fig. 2. In Stage-1, we train individual INRs to fit a target dataset. In Stage-2, we use generative models to learn weight distributions to generate new data. Directly modeling INR s weight distribution brings three challenges: (1) independently trained INRs make it hard to extract shared information for distribution learning, (2) the high dimensionality of raw weights makes distribution modeling highly challenging, and (3) it ignores the hierarchical structure of INRs. To address them, we configure INR as a Layers-of-Experts (Lo E) network, where each layer contains a set of shared expert weights and an instance-specific latent. As shown in Fig. 2 left, the inference process builds each INR layer by layer, combining experts with corresponding latents. This structure captures shared information through experts, simplifies distribution learning by focusing on latents, and explicitly models conditional dependencies within the hierarchical structure of INRs. In the following Controllable Data Generation with Hierarchical Neural Representations Input Output Latent to condition Latent Figure 2. CHINR consists of two stages. In Stage-1, a Layer-of-Experts (Lo E) model represents data with instance-specific latents and shared experts. The latent at each layer (shaded differently) modulates the mixture of experts at that layer. Stage-2 introduces a Hierarchical Conditional Diffusion Model (HCDM) to learn layer-wise conditional distributions of latents. At inference, we sample latents according to the conditional chain to achieve hierarchical control. sections, we first define the Lo E structure and learning task, followed by detailed explanations of the two stages. 3.1. Problem Statement Suppose an INR fθ has L layers. For layer l, we learn a collection of K cross-data shared expert weights θl = {θl 1, θl 2, , θl K} (fully connected layers) and a unique latent hl RH for each data instance. At inference, the operation at layer l is yl+1 = sin(ω0 ( θl yl)), where y represents each layer s output and ω0 is a constant factor. θl = PK n=1 θl k αl k denotes instance-specific parameters at layer l, modulated by a gating vector αl, which is computed by a gating module gϕ(hl) = αl = [αl 1, αl 2, , αl K] . Compared with directly learning the gating vectors, gϕ( ) allows for a more compact latent hl that benefits distribution learning. By modulating the contribution of experts via latents, each layer gains the flexibility to adapt to individual data samples with a shared basis. As L layers are cascaded to form the final INR, its expressive capacity is significantly enhanced through the integrated contributions across layers. In Stage-1, we optimize the shared network parameters θ = {θ1, θ2, , θL, ϕ} and fully characterize each instance-specific INR by layer-wise stacked latents h = [h1, h2, , h L] RH L. This layer-wise structure enables hierarchical modeling of INR parameters, which aligns with data semantic hierarchy, allowing for layer-wise dependency modeling in Stage-2, and controllable data generation. 3.2. Stage-1: Learning a Dataset of Lo E INRs Similar to Functa (Dupont et al., 2022a) and m NIF (You et al., 2024), we use meta-learning and auto-decoding to train both the data-specific latents h and the shared parameters θ for the Lo E INR during Stage-1. For meta- learning, we adopt an interleaved training procedure inspired by CAVIA (Zintgraf et al., 2019), where the experts and latents are updated alternately in separate training loops. In the inner loop, we fix θ and adapt the latents h to data samples. Within each inner loop, h is first randomly initialized around zero and then updated for a few steps. In the outer loop, θ is optimized based on the updated h. This ensures each data-specific latent can be effectively learned within a few iterations, encouraging faster convergence and adaptation to new data, which is essential for distribution modeling and generalization in Stage-2. In the case of autodecoding, we jointly optimize all parameters, maintaining a latent bank for the dataset and updating the sampled batch of latents in each iteration. Unlike meta-learning, autodecoding does not require second-order derivatives, making it more computationally efficient. Due to this efficiency, we apply auto-decoding specifically for Ne RF training. In both approaches, each data-specific latent h consists of L components separately modulating L layers in Lo E INR. This setup facilitates conditional distribution modeling in Stage-2, as opposed to joint distribution learning like m NIF and Functa, enabling hierarchical and controllable generation a crucial capacity lacking in prior works. 3.3. Stage-2: Conditional Distribution Learning Given a collection of latents H = {h1, , h N|hn RH L} obtained from N data instances, where L denotes number of layers, and H the dimension of each layer s latent, Stage-2 aims at learning latent distribution p(h). Instead of blindly modeling joint distribution, we reformulate p(h) as: p(h) = p(h1, h2, , h L) = p(h1) l=2 p(hl|h