# signal_processing_for_implicit_neural_representations__a3668046.pdf Signal Processing for Implicit Neural Representations Dejia Xu dejia@utexas.edu Peihao Wang peihaowang@utexas.edu Yifan Jiang yifanjiang97@utexas.edu Zhiwen Fan zhiwenfan@utexas.edu Zhangyang Wang atlaswang@utexas.edu The University of Texas at Austin https://vita-group.github.io/INSP/ Implicit Neural Representations (INRs) encoding continuous multi-media data via multi-layer perceptrons has shown undebatable promise in various computer vision tasks. Despite many successful applications, editing and processing an INR remains intractable as signals are represented by latent parameters of a neural network. Existing works manipulate such continuous representations via processing on their discretized instance, which breaks down the compactness and continuous nature of INR. In this work, we present a pilot study on the question: how to directly modify an INR without explicit decoding? We answer this question by proposing an implicit neural signal processing network, dubbed INSP-Net, via differential operators on INR. Our key insight is that spatial gradients of neural networks can be computed analytically and are invariant to translation, while mathematically we show that any continuous convolution filter can be uniformly approximated by a linear combination of high-order differential operators. With these two knobs, INSP-Net instantiates the signal processing operator as a weighted composition of computational graphs corresponding to the high-order derivatives of INRs, where the weighting parameters can be data-driven learned. Based on our proposed INSPNet, we further build the first Convolutional Neural Network (CNN) that implicitly runs on INRs, named INSP-Conv Net. Our experiments validate the expressiveness of INSP-Net and INSP-Conv Net in fitting low-level image and geometry processing kernels (e.g. blurring, deblurring, denoising, inpainting, and smoothening) as well as for high-level tasks on implicit fields such as image classification. 1 Introduction The idea that our visual world can be represented continuously has attracted increasing popularity in the field of implicit neural representations (INR). Also known as coordinate-based neural representations, INRs learn to encode a coordinate-to-value mapping for continuous multi-media data. Instead of storing the discrete signal values in a grid of pixels or voxels, INRs represent discrete data as samples of a continuous manifold. Using multi-layer perceptrons, INRs bring practical benefits to various computer vision applications, such as image and video compression [1, 2, 3], 3D shape representation [4, 5, 6, 7, 8, 9, 10, 11], inverse problems [12, 2, 13, 14], and generative models [15, 16, 17, 18, 19, 20, 21, 22]. Despite their recent success, INRs are not yet amenable to flexible editing and processing as the standard images could do. The encoded coordinate-to-value mapping is too complex to comprehend and the parameters stored in multi-layer perceptrons (MLPs) remains less explored. One direction of existing approaches enables editing on INRs by training them with conditional input. For example, Equal Contribution. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Output INR Input INR 1. Fit an INR Image Processing Classification 2. Signal Processing on an INR 3. Multi-tasking Explicit Encoding Geometric Processing Blurring Inpainting Figure 1: An illustration of implicit neural signal processing. Given an INR representing digital signals, our INSP-Net is capable of direct signal processing without needing to explicitly decode it. Our model first constructs derivative computation graphs of the original INR and then generates a linear combination of them into a new INR. It can be later decoded into discretized forms such as image pixels. The framework is capable of fitting low-level image processing kernels as well as performing high-level processing such as image classification. [23, 24, 25, 20, 21, 26] utilize conditional codes to indicate different characteristics of the scene including shape and color. Another main direction benefits from existing image editing techniques and operates on discretized instances of continuous INRs such as pixels or voxels. However, such solutions break down the continuous characteristic of INR due to the prerequisite of decoding and discretizing before editing and processing. In this paper, we conduct the first pilot study on the question: how to generally modify an INR without explicit decoding? The major challenge is that one cannot directly interpret what the parameters in an INR stand for, not to mention editing them correctly. Our key motivation is that spatial gradients can be served as a favorable tool to tackle this problem as they can be computed analytically, and possess desirable invariant properties. Theoretically, we prove that any continuous convolution filter can be uniformly approximated by a linear combination of high-order differential operators. Based on the above two rationales, we propose an Implicit Neural Signal Processing Network, dubbed INSP-Net, which processes INR utilizing high-order differential operators. The proposed INSP-Net is composed of an inception fusion block connecting computational graphs corresponding to derivatives of INRs. The weights in the branchy part are loaded from the INR being processed, while the weights in the fusion block are parameters of the operator, which can be either hand-crafted or learned by the data-driven algorithm. Even though we are not able to perform surgery on neural network parameters, we can implicitly process them by retrofitting their architecture and reorganizing the spatial gradients. We further extend our framework to build the first Convolutional Neural Network (CNN) operating directly on INRs, dubbed INSP-Conv Net. Each layer of INSP-Conv Net is constructed by linearly combining the derivative computational graphs of the former layers. Nonlinear activation and normalization are naturally supported as they are element-wise functions. Data augmentation can be also implemented by augmenting the input coordinates of INRs. Under this pipeline (shown in Fig. 1), we demonstrate the expressiveness of our INSP-Net framework in fitting low-level image processing kernels including edge detection, blurring, deblurring, denoising, and image inpainting. We also successfully apply our INSP-Conv Net to high-level tasks on implicit fields such as classification. Our main contributions can be summarized as follows: We propose a novel signal processing framework, dubbed INSP-Net, that operates on INRs analytically and continuously by closed-form high-order differential operators2. Repeatedly cascading the computational paradigm of INSP-Net, we also build a convolutional network, called INSP-Conv Net, which directly runs on implicit fields for high-level tasks. We illustrate the advantage of adopting differential operators by revealing their inherent group invariance. Furthermore, we rigorously prove that the convolution operator in the continuous regime can be uniformly approximated by a linear combination of the gradients. Extensive experiments demonstrate the effectiveness of our approach in both low-level processing (e.g. edge detection, blurring, deblurring, denoising, image inpainting, and smoothening) and high-level processing such as image classification. 2By saying closed-form , we mean the computation follows from an analytical mathematical expression. Derivative Networks Inception Fusion Layer NL x NL NL' Original INR Network First-order Derivative Network Figure 2: The left image provides an overview of our INSP-Net framework. Each layer combines the high-order derivative computational graphs of the original INR network. The right image illustrates the weight sharing scheme in calculating the derivative sub-networks. 2 Preliminaries: Implicit Neural Representation Implicit Neural Representation (INR) parameterizes continuous multi-media signals or vector fields with neural networks. Formally, we consider an INR as a continuous function Φ : Rm ! R that maps low-dimension spatial/temporal coordinates to the value space3. For example, to represent 2D image signals, the domain of Φ is (x, y) spatial coordinates, and the range of Φ are the pixel intensities. The typical use of INR is to solve a feasibility problem where Φ is sought to satisfy a set of N constraints {C(Φ, aj| j)}N j=1, where C is a functional that relates function Φ to some observable quantities aj evaluating over a measurable domain j Rm . This problem can be cast into an optimization problem that minimizes deviations from each of the constraints: Φ = arg min k C(Φ, aj| j)k2. (1) For instance, we can let C = Φ(xj) aj with j = {xj}, then our objective boils down to a point-to-point supervision which memorizes a signal into Φ [27]. When functional C is a combination of differential operators taking values in a point set, i.e., C(a(x), Φ(x), rΦ(x), ), 8x 2 j, Eq. 1 is objective to solving a bunch of differential equations [28, 7, 29]. Note that in this paper, without particular specification, the gradients are all computed with respect to the input coordinate x. C can also form an integral equation system over some intervals j [12]. In practice of computer vision, we reconstruct a signal by capturing sparse observations D = {( j, aj)}N j=1 from unknown function Φ, and dynamically sampling a mini-batch from D to minimize Eq. 1 to obtain a feasible Φ. A handy parameterization of function Φ is a fully-connected neural network, which enables solving Eq. 1 via gradient descent through a differentiable C. Common INR networks consist of pure Multi-Layer Perceptrons (MLP) with periodic activation functions. Fourier Feature Mapping (FFM) [27] places a sinusoidal transformation before the MLP, while Sinusoidal Representation Network (SIREN) [28] replaces every piece-wise linear activation with a sinusoidal function. Below we give a unified formulation of INR networks: Φ(x) = W n(φn 1 φn 2 φ1)(x), φi(x) = σi(W ix + bi), (2) where W i 2 Rdi 1 di, bi 2 Rdi are the weight matrix and bias of the i-th layer, respectively, n is the number of layers, and σi( ) is an element-wise nonlinear activation function. For FFM architecture, σi = sin( ) when i = 1 denotes the positional encoding layer [12, 30] and otherwise σi = Re LU( ). For SIREN, σi = sin( ) for every layer i = 1, , n 1. 3 Implicit Representation Processing via Differential Operators Digital Signal Processing (DSP) techniques have been widely applied in computer vision tasks, such as image restoration [31], signal enhancement [32] and geometric processing [33]. Even modern deep 3Without loss of generality, here we simplify Φ to be a scalar field, i.e., the range of Φ is one-dimensional. learning models are consisting of the most basic signal processing operators. Suppose we already acquire an Implicit Neural Representation (INR) Φ : Rm ! R, now we are interested in whether we can run a signal processing program on the implicitly represented signals. One straightforward solution is to rasterize the implicit field with a 2D/3D lattice and run a typical kernel on the pixel/voxel grids. However, this decoding strategy produces a finite resolution and discretizes signals, which is memory inefficient and unfriendly to modeling fine details. In this section, we introduce a computation paradigm that can process an INR analytically with spatial/temporal derivatives. We show that our proposed method serves as a universal operator that can represent any continuous convolutional kernels. 3.1 Computational Paradigm It has not escaped our notice that spatial/temporal gradients on INRs rkΦ can be computed analytically due to the differentiable characteristics of neural networks. Inspired by this, we propose an Implicit Neural Signal Processing (INSP) framework that composes a class of closed-form operators for INRs using functional combinations of high-order derivatives. We denote our proposed signal processing operator by A built upon high-order derivatives. Given an acquired INR Φ, we denote the resultant INR processed by operator A as = AΦ : Rm ! R. To evaluate point x 2 Rm of processed INR, we propose the following computational paradigm: (x) := AΦ(x) = Φ(x), rΦ(x), r2Φ(x), , rkΦ(x), where : RM ! R can be arbitrary continuous functions, which can be either handcrafted or learned from data. To learn an operator A from data, we represent by Multi-Layer Perceptrons (MLP) with parameters . Here we can slightly abuse the notation of rk to be a flattened vector of high-order derivatives without multiplicity since differential operators defined over continuous functions form a commutative ring. The input dimension of depends on the highest order of used derivatives. Suppose we compute derivatives up to K-th order, then M = PK is the number of distinctive k-th order differential operators 4. Intuitively, directional derivatives encode (local) neighboring information, which can have similar effects of a convolution. As we will show in Sec. 3.2, can construct both shift-invariant and rotation-invariant operators, which introduces favorable inductive bias to images and 3D geometry processing. More importantly, we rigorously prove that Eq. 3 is also a universal approximator of arbitrary convolutional operators. We note that (x) as a whole can also be regarded as a neural network. Recall the architecture of Φ(x) in Eq. 2, its k-th order derivative is another computational graph parameterized by W i and bi that maps x to rkΦ(x). For example, the first-order gradient will have the following form: rΦ(x) = ˆφn 1 (φn 2 φ1)(x) ˆφ2 φ1(x) W 1, (4) where ˆφi(y) = W > i 1(W i 1y + bk 1), and σ0 i( ) is the first-order derivative of σi( ). Since ˆφi shares the weights with φi, rΦ is represented by a closed-form computational network re-using the weights from Φ, which we refer to as the first-order derivative network. The higher-order derivatives should induce the derivative network of similar forms. Therefore, the processed INR will have an Inception-like architecture, namely, a multi-branch structure connecting the original INR network and weight-sharing derivative subnetworks followed by a fusion layer . We call the entire model ( = AΦ or Eq. 3) an Implicit Neural Signal Processing Network or an INSP-Net. Note that the only parameters of INSP-Net are located at the last fusion layer, and can be trained in an end-to-end manner. We illustrate an INSP-Net in Fig. 2 where the color indicates the weight-sharing scheme. A similar weight-sharing scheme is also adopted in Auto Int [34]. In practice, we employ auto-differentiation in Py Torch [35] to automatically create such derivatives networks and reassemble them parallelly to constitute the architecture of an INSP-Net. When inputting an INR, we load the weights of the INR to our model following the weight-sharing scheme, and then we obtain an INSP-Net, which implicitly and continuously represents the processed INR (x). To effectively express high-order derivatives, we choose SIREN as the base model [28]. 3.2 Theoretical Analysis In this section, we provide a theoretical justification for the design of our INSP-Net. We will focus on discussing the latent invariance property and the expressive power of INSP-Net. 4This is equal to the number of monic monomials over Rm with degree k. Translation and Rotation Invariance. Group invariance has been shown to be a favorable inductive bias for image [36], video [37], and geometry processing [38]. It has also been well-known that group invariance is an intrinsic property of Partial Differential Equations (PDEs) [39, 40]. Since our INSP-Net is built using differential operators, we are motivated to reveal its hidden invariance property to demonstrate its advantage in processing visual signals. In this section, we only consider two transformation groups: translation group T(m) and the special orthogonal group SO(m) (a.k.a. rotation group). Elements Tv 2 T(m) in translation group shift the function Φ by some offset v 2 Rm. The shifted function can be denoted as Φ Tv(x) = Φ(x + v). Similarly, elements in rotation group perform a coordinate transformation on function Φ by a rotation matrix R 2 SO(m). The transformed function can be written as Φ R(x) = Φ(Rx). Group invariance means deforming the input space of a function first and then processing it via an operator is equivalent to directly applying the transformation to the processed function. For a more rigorous argument, A is said to be translation-invariant if 8Tv 2 T(m), (x + v) = A[Φ Tv](x). Likewise, A is rotation-invariant if 8R 2 SO(m) we have (Rx) = A[Φ R](x). Below we provide Theorem 1 to characterize the invariance property for our model. Theorem 1. Given function : RM ! R, the composed operator A (Eq. 3) can satisfy: 1. shift invariance for every . 2. rotation invariance if has the form: (y) = f(kyk2) for some f : R ! R. We prove Theorem 1 in Appendix A. Our Theorem 1 implies that operator A is inherently shiftinvariant. This is due to the shift-invariant intrinsics of differential operators as we show in the proof. Rotation invariance is not guaranteed in general. However, if one carefully designs , it can also be achieved via our framework. Moreover, we also suggest a feasible solution to constructing a rotation-invariant operator A in Theorem 1. In our construction, first isotropically pools over the squares of all directional derivatives, and then maps the summarized information through another scalar function f. We refer interested readers to [39] for more group invariance in differential forms. Universal Approximation. Convolution, formally known as the linear shift-invariant operator, has served as one of the most prevalent signal processing tools in the vision domain. Given two (real-valued) signals f and g, we denote their convolution as g ? f = f ? g. In this section, we examine the expressiveness of our INSP-Net (Eq. 3) by showing it can represent any convolutional filter. To draw this conclusion, we first present an informal version of our main results as follows: Theorem 2. (Informal statement) For every real-valued function g : Rm ! R, there exists a polynomial p(x1, , xm) with real coefficients such that p (r) f can uniformly approximate g ? f by arbitrary precision for all real-valued signals f. The formal statement and proof can be found in Appendix B. Theorem 2 involves the notion of polynomials in partial differential operators (see more details in Appendix B). p(r)f in turn can be written as a linear combination of high-order derivatives of f (a special case of Eq. 3 when is linear). The key step to prove Theorem 2 is applying Stone-Weierstrass approximation theorem on the Fourier domain. However, we note that functions obtained by the Fourier transform are generally complex functions. The prominence of our proof is that we can constrain the range of the polynomial coefficients into the real domain, which makes it implementable via a common deep learning infrastructure. The implication of Theorem 2 is that the mapping between convolution and derivative is as simple as a linear transformation. Recent works [41, 42, 43] show the converse argument that derivatives can be approximated via a linear combination of discrete convolution. Theorem 2 establishes the equivalence between differential operator and convolution in the continuous regime. In our proof, k-th order derivatives correspond to k-th order monomial in the spectral domain. Fitting convolution using derivatives amounts to approximating spectrum via polynomials. This implies higher degree of polynomial induces closer approximation. Since p(r) is not difficult to be approximated by a neural network , we can easily derive the next result Corollary 3. Corollary 3. For every real-valued function g, there exists a neural network such that = AΦ (Eq. 3) can uniformly approximate g ? Φ by arbitrary precision for every real-valued signals Φ. As we discussed in Theorem 1, A are constantly shift-invariant. This means when approximating a convolutional kernel, the trajectory of A is restricted into the shift-invariant space. Moreover, we emphasize that INSP-Net is far more expressive than convolutional kernels since can also fit any nonlinear continuous functions due to the universal approximation theorem [44, 45, 46]. 3.3 Building CNNs for Implicit Neural Representations Convolutional Neural Networks (CNN) are capable of extracting informative semantics by only piling up basic signal processing operators. This motivates us to build CNNs based on INSP-Net that can directly run on INRs for high-level downstream tasks. In fact, to simulate exact convolution, our Theorem 2 suggests simplify to a linear mapping. Then our former computational paradigm Eq. 3 is changed to: (x) := p(r)Φ(x) = 0Φ(x) + > 1 rΦ(x) + > 2 r2Φ(x) + + > k rkΦ(x) + , (5) where k 2 R( k ) are parameters of the operator p(r). We name this special case of Eq. 3 as INSP-Conv. One plausible implementation of INSP-Conv is to employ a one-layer MLP to represent . When A = p(r), INSP-Conv preserves both linearity and shift-invariance when evolving during the training. We propose to repeatedly apply INSP-Conv with non-linearity to INRs that mimics a CNN-like architecture. We name this class of CNNs composed by multi-layer INSP-Conv (Eq. 5) as INSP-Conv Net. Previous works [47, 48] extracting semantic features from INR either lack local information by point-wisely mapping INR s intermediate representation to a semantic space or explicitly rasterize INR into regular grids. To the best of our knowledge, it is the first time that one can run a CNN directly on an implicit representation thanks to closed-formness of INSP-Net. The overall architecture of INSP-Conv Net can be formulated as: Conv Net[Φ](x) = A(L) σ A(L 1) σ A(2) σ A(1) Φ(x), (6) where σ is an element-wise non-linear activation, L is the number of INSP-Net layers, and Φ is an input INR. We use the symbol to denote operator functioning, and to denote function composition. Due to page limit, we defer detailed introduction to INSP-Conv Net to Appendix C. 4 Related Work 4.1 Implicit Neural Representation Implicit Neural Representation (INR) represents signals by continuous functions parameterized by multi-layer perceptrons (MLPs) [28, 27], which is different from traditional discrete representations (e.g., pixel, mesh). Compared with other representations, the continuous implicit representations are capable of representing signals at infinite resolution and have become prevailing to be applied upon image fitting [28], image compression [1, 49] and video compressing [3]. In addition, INR has been applied to more efficient and effective shape representation [4, 5, 6, 7, 8, 9, 10, 11], texture mapping [50, 51], inverse problems [12, 2, 13, 14] and generative models [15, 16, 17, 18, 19, 20, 21, 22]. There are also efforts speeding up the fitting of INRs [52] and improving the representation efficiency [53]. Nowadays, editing and manipulating multi-media objects gains increasing interest and demand [54]. Thus, signal processing on implicit neural representation is essentially an important task worth investigating. 4.2 Editable Implicit Fields Editing implicit fields has recently attracted much research interest. Several methods have been proposed to allow editing the reconstructed 3D scenes by rearranging the objects or manipulating the shape and appearance. One line of work alters the structure and color of objects by conditioning latent codes for different characteristics of the scene [25, 20, 21, 26]. Another direction involves discretizing the continuous fields. By converting the implicit fields into pixels or voxels, traditional image and voxel editing techniques [55, 56] can be applied effortlessly. These approaches, however, are not capable of directly performing signal processing on continuous INRs. Functa [54] can use a latent code to control implicit funtions. NID [57] represents neural fields as a linear combination of implicit functional basis, which enables editing by change of sparse coefficients. However, such editing scheme suffers from limited flexibility. Recently, NFGP [58] proposes to use neural fields for geometry processing by exploring various geometric regularization. INS [59] distills stylized features into INRs via the neural style transfer framework [60]. Our INSP-Net that makes smart use of closed-form differential operators does not require neither additional per-scene fine-tuning nor discretization to grids. Input Image Sobel Filter Canny Filter Prewitt Filter INSP-Net Figure 3: Edge detection. We fit the natural images with SIREN and use our INSP-Net to process implicitly into a new INR that can be decoded into edge maps. INR Fitted Noisy Mean Filter MPRNet INSP-Net Target Image 20.14/0.60 20.09/0.61 20.51/0.66 24.02/0.76 PSNR/SSIM Figure 4: Image denoising. We fit the noisy images with SIREN and train our INSP-Net to process implicitly into a new INR that can be decoded into natural clear images. 4.3 PDE based Image Processing Partial differential equations (PDEs) have been successfully applied to many tasks in image processing and computer vision, such as image enhancement [61, 62, 63], segmentation [64, 40], image registration [65], saliency detection [66] and optical flow computation [67]. Early traditional PDEs are written directly based on mathematical and physical understanding of the PDEs (e.g., anisotropic diffusion [61], shock filter [62] and curve evolution based equations [68, 69, 70]). Variational design methods [63, 71, 70] start from an energy function describing the desired properties of output images and compute the Euler-Lagrange equation to derive the evolution equations. Learning-based attempts [40, 66] build PDEs from image pairs based on the assumption (without proof) that PDEs could be written as linear combinations of fundamental differential invariants. Although it might be feasible to let INRs solve this bunch of signal processing PDEs, one needs to per-case re-fit an INR with an additional temporal axis, which is presumably sampling inefficient. The multi-layer structure appearing in INSP-Net can be viewed as an unfolding network [72, 73] of the Euler method to solve time-variant PDEs [74]. We elaborate this connection in Appendix D. 5 Experiments In this section, we evaluate the proposed INSP framework on several challenging tasks, using different combinations of . First, we build low-level image processing filters using either hand-crafted or learnable . Then, we construct convolutional neural networks with our INSP-Conv Net framework and validate its performance on image classification. More results and implementation details are provided in the Appendix E F. 5.1 Low-Level Vision for Implicit Neural Images For low-level image processing, we operate on natural images from Set5 dataset [75], Set14 dataset [76], and DIV-2k dataset [77]. Originally designed for super-resolution, the images are diverse in style and content. Note that the unprocessed images presented in figures are the images decoded from unprocessed INRs. Since our method operates directly on INRs, we firstly fit the images with INRs and then feed the INRs into our framework. The final output is another INR which can be decoded into desired images. The training set of our method consists of 90 examples of INRs, where each INR is built on SIREN [28] architectures. INR Fitted Blur Image Wiener Filter MPRNet INSP-Net Target Image 23.88/0.77 21.72/0.52 26.73/0.83 27.67/0.79 PSNR/SSIM Figure 5: Image deblurring. We fit the blurred images with SIREN and train our INSP-Net to process implicitly into a new INR that can be decoded into clear natural images. Original Image Box Filter Gaussian Filter INSP-Net Figure 6: Image blurring. We fit the natural images with SIREN and train our INSP-Net to process implicitly into a new INR that can be decoded into blurred images. Edge Detection Since the edges correspond to gradients in the images, using gradients of INRs to obtain edges is straightforward. 1 is set to 1 while other coefficients are set to 0. We provide visual comparisons against Sobel filter [78], Canny detector [79] and Prewitt operator [80] in Fig. 3. Image Denoising For classical image denoising filters, we compare against the median filter and mean filter. We use MPRNet [81] as a learning-based baseline. The input noisy images are synthesized using additive gaussian noise. Visual results are provided in Fig. 4. Image Blurring Image blurring is a low-pass filtering operation. We provide a visual comparison against classical filters including 3 3 box filter and 3 3 gaussian filter. The target images used for training our INSP-Net are the results of the Gaussian filter. Visual results are provided in Fig. 6. Image Deblurring We compare the proposed method with both traditional algorithms (e.g., wiener filter [82]) and learning-based algorithms(e.g., MPRNet [81]). We synthesize blurry images using Gaussian filters. As shown in Fig. 5, Wiener Filter produce severe artifacts and MPRNet successfully reconstructs clear textures. INSP-Net is capable of generating competitive results against MPRNet and outperforms the Wiener Filter. Image Inpainting We conduct two kinds of experiments in image inpainting, to inpaint 30% random masked pixels or to remove the texts ( INSP-Net ). Comparison methods include mean filter, median filter, and La Ma [83]. La Ma is a learning-based method using Fourier convolution for inpainting. As shown in Fig. 7, mean filter and median filter partially restore the masked pixels, but severely hurt the visual quality of the rest parts. Also, they can not handle the text region. La Ma successfully removes the text and inpaint the masked pixels. Our proposed method largely outperforms the filter-based algorithms and performs as well as the La Ma. 5.2 Geometry Processing on Signed Distance Function We demonstrate that the proposed INSP framework is not only capable of processing images, but also capable of processing geometry. Signed Distance Function (SDF) [25] is adopted to represent geometries in this section. We first fit an SDF from a point cloud following the training loss proposed in [28, 7]. Then we train our INSP-Net to simulate a Gaussian-like filter similar to image blurring. Afterwards, we apply the trained INSP-Net to process the specified INR. When visualization, we use marching cube algorithm to extract meshes from SDF. We choose Thai Statue, Armadillo, and Dragon from Stanford 3D Scanning Repository [84, 85, 86, 87] to demonstrate our results. Fig. 8 Input Image Mean Filter Median Filter La Ma INSP-Net Target Image 12.60/0.43 17.02/0.53 18.99/0.63 26.40/0.88 23.07/0.76 PSNR/SSIM 26.98/0.96 26.80/0.90 26.41/0.88 23.29/0.73 33.44/0.95 PSNR/SSIM Figure 7: Image inpainting. We fit the input images with SIREN and train our INSP-Net to process implicitly into a new INR that can be decoded into natural images. Note that La Ma requires explicit masks to select the regions for inpainting and the masks are roughly provided. INR Fitted SDF SDF Smoothened by INSP-Net Figure 8: Left: unprocessed geometry decoded from an unprocessed INR. Right: smoothened geometry decoded from the output INR of our INSP-Net. Best view in a zoomable electronic copy. exhibits our results on Thai Statue. Our method is able to smoothen the surface of the geometry and erase high-frequency details acting as if a low-pass filter. We defer more results to Fig. 14. 5.3 Classification on Implicit Neural Representations We demonstrate that the proposed INSP framework is not only capable to express low-level image processing filters, but also supports high-level tasks such as image classification. To achieve this goal, we construct a 2-layer INSP-Conv Net. The INSP-Conv Net consists of 2 INSP-Net layers. Each of them decomposes the INR via the differential operator and combines them with learnable . We build another 2-layer depthwise Conv Nets running on pixels as the baseline for a fair comparison, since it has comparable expressiveness to our INSP-Conv Net in theory. We also build a PCA + SVM method and an MLP classifier that directly classify INRs according to (vectorized) weight matrices. We evaluate the proposed INSP-Conv Net on MNIST (28 28 resolution) and CIFAR-10 (32 32 resolution) datasets, respectively. For each dataset, we will firstly fit each image into an implicit representation using SIREN [28]. Both experiments take 1000 epochs to optimize with Adam W optimizer [88] and a learning rate of 10 4. Results are shown in Tab. 1. Accuracy Depthwise CNN PCA + SVM MLP classifier INSP-Conv Net MNIST 87.6% 11.3% 9.8% 88.1% CIFAR-10 59.5% 9.4% 10.1% 62.5% Table 1: Quantitative Results of Image Classification. All methods except Depthwise CNN operate on the parameter of INR directly, while Depthwise CNN operates on images decoded from INR. We categorize Depthwise CNN as explicit method, which requires to extract the image grids from INRs before classification. PCA + SVM and MLP classifier working on the network parameter space can be regarded as two straightforward implicit baselines. We find that traditional classifiers can hardly classify INR on weight space due to its high-dimensional unstructured data distribution. Our method, however, can effectively leverage the information implicitly encoded in INRs by exploiting their derivatives. As a consequence, INSP-Conv Net can achieve classification accuracy on-par with CNN-based explicit method, which validates the learning representation power of INSP-Conv Net. 6 Conclusion Contribution. We present INSP-Net framework, an implicit neural signal processing network that is capable of directly modifying an INR without explicit decoding. By incorporating differential operators on INR, we can instantiate the INR signal operator as a composition of computational graphs approximating any continuous convolution filter. Furthermore, we make the first effort to build a convolutional neural network that implicitly runs on INRs. While all other methods run on discrete grids, our experiment demonstrates our INSP-Net can achieve competitive results with entirely implicit operations. Limitations. (Theory) Our theory only guarantees the expressiveness of convolution by allowing infinite sequence approximation. Construction of more expressive operators and more effective parameterization for convolution remain widely open questions. (Practice) INSP-Net requires the computation of high-order derivatives which is neither memory efficient nor numerically stable. This hinders the scalability of our INSP-Conv Net that requires recursive computation of derivatives. Addressing how to reconstruct INRs in a scalable manner is beyond the scope of this paper. All INRs used in our experiments are fitted by per-scene optimization. Acknowledgement Z. Wang is in part supported by an NSF Scale-Mo DL grant (award number: 2133861). [1] Emilien Dupont, Adam Goli nski, Milad Alizadeh, Yee Whye Teh, and Arnaud Doucet. Coin: Compression with implicit neural representations. ar Xiv preprint ar Xiv:2103.03123, 2021. [2] Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser Nam Lim, and Abhinav Shrivastava. Nerv: Neural representations for videos. Advances in Neural Information Processing Systems, 34, 2021. [3] Yunfan Zhang, Ties van Rozendaal, Johann Brehmer, Markus Nagel, and Taco Cohen. Implicit neural video compression. ar Xiv preprint ar Xiv:2112.11312, 2021. [4] Kyle Genova, Forrester Cole, Daniel Vlasic, Aaron Sarna, William T Freeman, and Thomas Funkhouser. Learning shape templates with structured implicit functions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7154 7164, 2019. [5] Matan Atzmon, Niv Haim, Lior Yariv, Ofer Israelov, Haggai Maron, and Yaron Lipman. Controlling neural level sets. Advances in Neural Information Processing Systems, 32, 2019. [6] Zhiqin Chen and Hao Zhang. Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5939 5948, 2019. [7] Amos Gropp, Lior Yariv, Niv Haim, Matan Atzmon, and Yaron Lipman. Implicit geometric regularization for learning shapes. ar Xiv preprint ar Xiv:2002.10099, 2020. [8] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4460 4470, 2019. [9] Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Occupancy flow: 4d reconstruction by learning particle dynamics. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5379 5389, 2019. [10] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 165 174, 2019. [11] Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger. Convolutional occupancy networks. In European Conference on Computer Vision, pages 523 540. Springer, 2020. [12] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pages 405 421. Springer, 2020. [13] Vincent Sitzmann, Semon Rezchikov, William T Freeman, Joshua B Tenenbaum, and Fredo Durand. Light field networks: Neural scene representations with single-evaluation rendering. ar Xiv preprint ar Xiv:2106.02634, 2021. [14] Ben Mildenhall, Peter Hedman, Ricardo Martin-Brualla, Pratul Srinivasan, and Jonathan T Barron. Nerf in the dark: High dynamic range view synthesis from noisy raw images. ar Xiv preprint ar Xiv:2111.13679, 2021. [15] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5799 5809, 2021. [16] Terrance De Vries, Miguel Angel Bautista, Nitish Srivastava, Graham W Taylor, and Joshua M Susskind. Unconstrained scene generation with locally conditioned radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14304 14313, 2021. [17] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt. Stylenerf: A style-based 3d-aware generator for high-resolution image synthesis. ar Xiv preprint ar Xiv:2110.08985, 2021. [18] Zekun Hao, Arun Mallya, Serge Belongie, and Ming-Yu Liu. Gancraft: Unsupervised 3d neural rendering of minecraft worlds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14072 14082, 2021. [19] Quan Meng, Anpei Chen, Haimin Luo, Minye Wu, Hao Su, Lan Xu, Xuming He, and Jingyi Yu. Gnerf: Gan-based neural radiance field without posed camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6351 6361, 2021. [20] Michael Niemeyer and Andreas Geiger. Giraffe: Representing scenes as compositional genera- tive neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11453 11464, 2021. [21] Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Geiger. Graf: Generative radiance fields for 3d-aware image synthesis. Advances in Neural Information Processing Systems, 33:20154 20166, 2020. [22] Peng Zhou, Lingxi Xie, Bingbing Ni, and Qi Tian. Cips-3d: A 3d-aware generator of gans based on conditionally-independent pixel synthesis. ar Xiv preprint ar Xiv:2110.09788, 2021. [23] Steven Liu, Xiuming Zhang, Zhoutong Zhang, Richard Zhang, Jun-Yan Zhu, and Bryan Russell. Editing conditional radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5773 5783, 2021. [24] Can Wang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. Clip-nerf: Text- and-image driven manipulation of neural radiance fields. ar Xiv preprint ar Xiv:2112.05139, 2021. [25] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 165 174, 2019. [26] Zhiqin Chen and Hao Zhang. Learning implicit fields for generative shape modeling. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5932 5941, 2019. [27] Matthew Tancik, Pratul P Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Ragha- van, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. ar Xiv preprint ar Xiv:2006.10739, 2020. [28] Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Im- plicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems, 33:7462 7473, 2020. [29] Jiequn Han, Arnulf Jentzen, and E Weinan. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505 8510, 2018. [30] Ellen D Zhong, Tristan Bepler, Bonnie Berger, and Joseph H Davis. Cryodrgn: reconstruction of heterogeneous cryo-em structures using neural networks. Nature Methods, 18(2):176 185, 2021. [31] Dejia Xu, Yihao Chu, and Qingyan Sun. Moiré pattern removal via attentive fractal network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 472 473, 2020. [32] Yaqian Xu, Wenqing Zheng, Jingchen Qi, and Qi Li. Blind image blur assessment based on markov-constrained fcm and blur entropy. In 2019 IEEE international conference on image processing (icip), pages 4519 4523. IEEE, 2019. [33] Peng-Shuai Wang, Xiao-Ming Fu, Yang Liu, Xin Tong, Shi-Lin Liu, and Baining Guo. Rolling guidance normal filter for geometric processing. ACM Transactions on Graphics (TOG), 34(6):1 9, 2015. [34] David B Lindell, Julien NP Martel, and Gordon Wetzstein. Autoint: Automatic integration for fast neural volume rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14556 14565, 2021. [35] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019. [36] Jan J Koenderink and Andrea J van Doorn. Image processing done right. In European Conference on Computer Vision, pages 158 172. Springer, 2002. [37] Hae-Kwang Kim and Jong-Deuk Kim. Region-based shape descriptor invariant to rotation, scale and translation. Signal Processing: Image Communication, 16(1-2):87 93, 2000. [38] Niloy J Mitra, Mark Pauly, Michael Wand, and Duygu Ceylan. Symmetry in 3d geometry: Extraction and applications. In Computer Graphics Forum, volume 32, pages 1 23. Wiley Online Library, 2013. [39] Peter J Olver. Applications of Lie groups to differential equations, volume 107. Springer Science & Business Media, 2000. [40] Risheng Liu, Zhouchen Lin, Wei Zhang, and Zhixun Su. Learning pdes for image restoration via optimal control. In European Conference on Computer Vision, pages 115 128. Springer, 2010. [41] Bin Dong, Qingtang Jiang, and Zuowei Shen. Image restoration: Wavelet frame shrinkage, nonlinear evolution pdes, and beyond. Multiscale Modeling & Simulation, 15(1):606 660, 2017. [42] Zichao Long, Yiping Lu, Xianzhong Ma, and Bin Dong. Pde-net: Learning pdes from data. In International Conference on Machine Learning, pages 3208 3216. PMLR, 2018. [43] Zichao Long, Yiping Lu, and Bin Dong. Pde-net 2.0: Learning pdes from data with a numeric- symbolic hybrid deep network. Journal of Computational Physics, 399:108925, 2019. [44] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 1989. [45] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, [46] Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103 114, 2017. [47] Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and Andrew J Davison. In-place scene labelling and understanding with implicit scene representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15838 15847, 2021. [48] Suhani Vora, Noha Radwan, Klaus Greff, Henning Meyer, Kyle Genova, Mehdi SM Sajjadi, Etienne Pot, Andrea Tagliasacchi, and Daniel Duckworth. Nesf: Neural semantic fields for generalizable semantic segmentation of 3d scenes. ar Xiv preprint ar Xiv:2111.13260, 2021. [49] Brandon Yushan Feng and Amitabh Varshney. Signet: Efficient neural representation for light fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14224 14233, 2021. [50] Michael Oechsle, Lars Mescheder, Michael Niemeyer, Thilo Strauss, and Andreas Geiger. Texture fields: Learning texture representations in function space. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4531 4540, 2019. [51] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2304 2314, 2019. [52] Matthew Tancik, Ben Mildenhall, Terrance Wang, Divi Schmidt, Pratul P Srinivasan, Jonathan T Barron, and Ren Ng. Learned initializations for optimizing coordinate-based neural representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2846 2855, 2021. [53] Jaeho Lee, Jihoon Tack, Namhoon Lee, and Jinwoo Shin. Meta-learning sparse implicit neural representations. Advances in Neural Information Processing Systems, 34:11769 11780, 2021. [54] Emilien Dupont, Hyunjik Kim, SM Eslami, Danilo Rezende, and Dan Rosenbaum. From data to functa: Your data point is a function and you should treat it like one. ar Xiv preprint ar Xiv:2201.12204, 2022. [55] James W. Hennessey, Wilmot Li, Bryan Russell, Eli Shechtman, and Niloy J. Mitra. Transferring image-based edits for multi-channel compositing. ACM Transactions on Graphics, 36(6), 2017. [56] Jerry Liu, Fisher Yu, and Thomas Funkhouser. Interactive 3d modeling with a generative adversarial network. In 2017 International Conference on 3D Vision (3DV), pages 126 134. IEEE, 2017. [57] Peihao Wang, Zhiwen Fan, Tianlong Chen, and Zhangyang Wang. Neural implicit dictionary learning via mixture-of-expert training. In International Conference on Machine Learning, pages 22613 22624. PMLR, 2022. [58] Guandao Yang, Serge Belongie, Bharath Hariharan, and Vladlen Koltun. Geometry processing with neural fields. Advances in Neural Information Processing Systems, 34, 2021. [59] Zhiwen Fan, Yifan Jiang, Peihao Wang, Xinyu Gong, Dejia Xu, and Zhangyang Wang. Unified implicit neural stylization. ar Xiv preprint ar Xiv:2204.01943, 2022. [60] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414 2423, 2016. [61] Pietro Perona and Jitendra Malik. Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on pattern analysis and machine intelligence, 12(7):629 639, 1990. [62] Stanley Osher and Leonid I Rudin. Feature-oriented image enhancement using shock filters. SIAM Journal on numerical analysis, 27(4):919 940, 1990. [63] Xue-Cheng Tai, Stanley Osher, and Randi Holm. Image inpainting using a tv-stokes equation. In Image Processing based on partial differential equations, pages 3 22. Springer, 2007. [64] Zhouchen Lin, Wei Zhang, and Xiaoou Tang. Designing partial differential equations for image processing by combining differental invariants, 2009. [65] Lars Hömke, Claudia Frohn-Schauf, Stefan Henn, and Kristian Witsch. Total variation based image registration. In Image processing based on partial differential equations, pages 343 361. Springer, 2007. [66] Risheng Liu, Junjie Cao, Zhouchen Lin, and Shiguang Shan. Adaptive partial differential equation learning for visual saliency detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3866 3873, 2014. [67] Adam Rabcewicz. Clg method for optical flow estimation based on gradient constancy assump- tion. In Image Processing Based on Partial Differential Equations, pages 57 66. Springer, 2007. [68] Guillermo Sapiro. Geometric partial differential equations and image analysis. Cambridge university press, 2006. [69] Frédéric Cao. Geometric curve evolution and image processing. Springer Science & Business Media, 2003. [70] Bart M Haar Romeny. Geometry-driven diffusion in computer vision, volume 1. Springer Science & Business Media, 2013. [71] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259 268, 1992. [72] Karol Gregor and Yann Le Cun. Learning fast approximations of sparse coding. In Proceedings of the 27th international conference on international conference on machine learning, pages 399 406, 2010. [73] Jialin Liu and Xiaohan Chen. Alista: Analytic weights are as good as learned weights in lista. In International Conference on Learning Representations (ICLR), 2019. [74] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018. [75] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie line Alberi Morel. Low- complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the British Machine Vision Conference, pages 135.1 135.10. BMVA Press, 2012. [76] Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse- representations. In International conference on curves and surfaces, pages 711 730. Springer, 2010. [77] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, July 2017. [78] I Sobel. An isotropic 3 3 image gradient operator, presentation at stanford ai project (1968), 2014. [79] John Canny. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence, (6):679 698, 1986. [80] Judith MS Prewitt et al. Object enhancement and extraction. Picture processing and Psychopic- torics, 10(1):15 19, 1970. [81] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, and Ling Shao. Multi-stage progressive image restoration. In CVPR, 2021. [82] Saeed V Vaseghi. Advanced digital signal processing and noise reduction. John Wiley & Sons, 2008. [83] Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. ar Xiv preprint ar Xiv:2109.07161, 2021. [84] Brian Curless and Marc Levoy. A volumetric method for building complex models from range images. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 303 312, 1996. [85] Andrew Gardner, Chris Tchou, Tim Hawkins, and Paul Debevec. Linear light source reflectom- etry. ACM Transactions on Graphics (TOG), 22(3):749 758, 2003. [86] Venkat Krishnamurthy and Marc Levoy. Fitting smooth surfaces to dense polygon meshes. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 313 324, 1996. [87] Greg Turk and Marc Levoy. Zippered polygon meshes from range images. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques, pages 311 318, 1994. [88] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017. [89] Kathrin Schacke. On the kronecker product. Master s thesis, University of Waterloo, 2004. [90] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 1251 1258, 2017. [91] Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li. Maxim: Multi-axis mlp for image processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5769 5780, 2022.