# compressiblecomposable_nerf_via_rankresidual_decomposition__8581a311.pdf Compressible-composable Ne RF via Rank-residual Decomposition Jiaxiang Tang1, Xiaokang Chen1, Jingbo Wang2, Gang Zeng1,3 1School of Intelligence Science and Technology, Peking University 2Chinese University of Hong Kong 3Intelligent Terminal Key Laboratory of Si Chuan Province {tjx, pkucxk}@pku.edu.cn, wj020@ie.cuhk.edu.hk, zeng@pku.edu.cn Neural Radiance Field (Ne RF) has emerged as a compelling method to represent 3D objects and scenes for photo-realistic rendering. However, its implicit representation causes difficulty in manipulating the models like the explicit mesh representation. Several recent advances in Ne RF manipulation are usually restricted by a shared renderer network, or suffer from large model size. To circumvent the hurdle, in this paper, we present a neural field representation that enables efficient and convenient manipulation of models. To achieve this goal, we learn a hybrid tensor rank decomposition of the scene without neural networks. Motivated by the low-rank approximation property of the SVD algorithm, we propose a rank-residual learning strategy to encourage the preservation of primary information in lower ranks. The model size can then be dynamically adjusted by rank truncation to control the levels of detail, achieving near-optimal compression without extra optimization. Furthermore, different models can be arbitrarily transformed and composed into one scene by concatenating along the rank dimension. The growth of storage cost can also be mitigated by compressing the unimportant objects in the composed scene. We demonstrate that our method is able to achieve comparable rendering quality to state-of-the-art methods, while enabling extra capability of compression and composition. Code is available at https://github.com/ashawkey/CCNe RF. 1 Introduction Photo-realistic rendering and manipulation of 3D scenes have been long standing problems with numerous real-world applications, such as VR/AR, computer games, and video creation. Recently, the volumetric Neural Radiance Field (Ne RF) representations [26, 1, 8, 27] show impressive progress in rendering photo-realistic images with rich details. However, due to this implicit representation of geometry and appearance, manipulating the underlying scenes encoded by Ne RF still remains a challenging problem. To solve this problem, some works [21, 45, 19] introduce scene-specific features and scene agnostic rendering network, so that scenes trained with a shared rendering network can be composed together. However, the constrained and biased capability of these rendering networks causes difficulty in extending to various objects or scenes. New objects have to be trained with a fixed rendering network to be compatible with the old objects. Other works [36] discard the rendering network and adopt an no-neural-network Ne RF representation, which is more convenient to manipulate the reconstructed scenes and is still able to render high-quality images. Nevertheless, the large storage requirement for each single model is detrimental to composing complex scenes with lots of objects. We present a novel approach that allows efficient and convenient manipulation of scenes represented with our model. Two aspects should be fulfilled to achieve this goal. The first is that we can dynamically adjust the model size to support different levels of detail (LOD) in different scenarios. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). transformation & composition dynamic compression PSNR: 36.01 PSNR: 33.62 Figure 1: Compressibility and Composability of our method. We present a tensor rank decomposition based neural field representation, which supports model compression through rank truncation, and arbitrary composition between different models through rank concatenation. Both of these operations require no extra optimization, or any constraints in training (e.g., a shared renderer). This functions similar to mipmaps in graphics and requires no extra optimization step. The second is that all models can be transformed and composed arbitrarily for manipulation with no constraints in training. This promises that our models are always reusable, and support the most basic operations in a 3D editor like blender [9]. We name these two properties as compressibility and composability. For the compressibility, we are motivated by the properties of Singular Value Decomposition (SVD) and High-order SVD (HOSVD) [10]. Our aim is to learn the decomposition of a 3D scene from only 2D observations like Tenso RF [8], and further preserve the near-optimal low-rank approximation property. We propose a simple and flexible tensor rank decomposition based neural radiance field, and a rank-residual learning strategy. Each 3D scene is modeled by a 4D feature volume, which can be described with a set of rank components and a matrix storing the weights for each feature channel. The rank components are either vectoror matrix-based, corresponding to the CANDECOMP/PARAFAC (CP) decomposition [14, 4] and a less compact triple plane variant. We introduce a rank-residual learning strategy to encourage the lower ranks to preserve more important information of the whole scene. Combined with an empirical sort-and-truncate strategy, the proposed method achieves nearoptimal low-rank approximation at any targeted rank. Different LODs are represented with different low-rank truncations of the model, allowing dynamic trade-off between model size and rendering quality without retraining. Besides, our model contains no neural networks and thus naturally supports composability. Since there are no MLP renderers in our model, we can compose different objects by simply concatenating their rank components. A transformation matrix is recorded for each object to control its position and orientation in the scene. As demonstrated in Figure 1, we are able to control each model s LOD and size in a flexible range, and perform arbitrary transformation and composition of different models. Furthermore, these two properties are connected together through the underlying concept of rank, and can be combined in practical use. For example, we can mitigate the growth of model size of a complex scene composed of multiple objects, by compressing the less important objects. Our contributions can be summarized as follows: (1) We propose a simple radiance field representation based on two types of tensor rank decomposition, which allows flexible control of model size and naturally supports transformation and composition of different models. (2) We design a rank-residual learning strategy to enable near-optimal low-rank approximation. After training, our model can be dynamically adjusted to trade off between performance and model size without retraining. (3) The proposed method reaches comparable rendering quality with state-of-the-arts, while additionally enabling both compressibility and composability. 2 Related Work 2.1 Scene Representation with Ne RF 3D scenes can be represented with various forms, including volumes, point clouds, meshes, and implicit representations [33, 34, 7, 30, 38, 25, 23, 28]. Ne RF [26] proposes to use a 5D function to represent the scene and applies volumetric rendering for novel view synthesis, achieving photorealistic results and detailed geometry reconstruction. This powerful representation quickly receives attention and is extensively studied and applied in various fields [50, 24], such as generative settings [6, 37, 29, 5], dynamic scenes [20, 31], and texture mapping [44]. In particular, we categorize recent progress by the design of the underlying functions into three classes: neural network-based, hybrid and no-neural-network. neural network-based representations typically apply an MLP, as the implicit function to encode 3D scenes. The original Ne RF [26] and most following works [1, 2, 51, 43, 46, 35] choose this representation for its simplicity. However, the training and inference speed of such a network is generally slow due to the relatively expensive MLP computation. Therefore, hybrid representations try to reduce the size of the MLP, by storing the 3D features in an explicit data structure. Since a dense 3D representation is unaffordable, different methods are explored. For example, NSVF [21] adopts sparse voxel grids, Plen Octrees [49] adopts octrees, instant-ngp [27] adopts a multi-scale hashmap, and Tenso RF [8] factorizes the scene into lower-rank components. Querying such hybrid representation is much faster, thus reducing training and inference time and even reaching interactive FPS. Lastly, no-neural-network representations attempt to model the 3D scene without neural networks. Plenoxels [36] shows that only the explicit sparse voxels representation is enough to model complex 3D scenes. Our method also belongs to this representation, sharing the similar tensor rank decomposition idea to Tenso RF [8], but we focus on two additional capabilities, i.e., compressibility and composability, which are important yet usually absent in previous work. 2.2 Tensor Decomposition and Low-rank Approximation Decomposition of high-order tensors [18] can be considered as the generalizations of matrix singular value decomposition. The Tucker decomposition [40] decomposes a tensor into a core tensor multiplied by a matrix along each mode. The CANDECOMP/PARAFAC (CP) decomposition [14, 4] factorizes a tensor into a sum of component rank-one tensors, and can be viewed as a special case of Tucker where the core tensor is superdiagonal. The high-order singular value decomposition [10] provides a method to compute a specific Tucker decomposition with an all-orthogonal core tensor. Low-rank approximation is a common problem that applies tensor decomposition, and has found various applications such as image compression. Although the truncated HOSVD does not hold the optimal property contrary to the truncated SVD, it still results in a quasi-optimal solution [10, 41, 12], which is enough to yield a sufficiently good solution in practical uses. Tensor rank decomposition and its variants [10, 11] has been used in various vision and learning tasks [47, 8, 48]. Specifically, Tenso RF [8] first leverages the CP decomposition and a Vertex-Matrix (VM) decomposition to factorize neural radiance fields, but its other designs (e.g., use of MLP) disturbs the property of tensor rank decomposition and prevents it from achieving compression or composition. Instead, we focus on modeling neural radiance fields only with tensor rank decomposition, and aim to preserve the low-rank approximation property, enabling the compression of a learned neural radiance field similar to the SVD compression of an image. 2.3 Manipulation and Composition of Ne RF Manipulation and Composition are important for a 3D representation s practical usage. Explicit 3D representations, e.g., meshes, are natively editable and composable. However, neural network-based implicit representations like a vanilla Ne RF is difficult to perform such operations. NSVF [21] can composite separate objects together, but these objects have to be trained together using a shared MLP, which limits its flexibility and potential usage. Later works [45, 29, 13, 19] learn object-compositional Ne RF, but are usually scene-specific and do not allow cross-scene composition without retraining. Geometry and appearance editing [22, 42] of neural fields also requires an extra optimization step to modify the neural network-based representation. With the explicit sparse voxel representation, Plenoxels [36] naturally supports direct composition of different objects, but suffers from the large storage on the dense index matrix. Our method also supports arbitrary affine transformations and compositions without extra optimization. Further, we can efficiently mitigate the model size growth due to the compact tensor rank decomposition and the compressibility. U𝑥𝑥, U𝑦𝑦, U𝑧𝑧 𝒰𝒰y,z, 𝒰𝒰x,z, 𝒰𝒰x,y Rvec + Rmat Figure 2: Model structure. Our model is composed of a matrix storing rank weights for different feature channels, and a set of decomposed rank components. Each rank component can be either vectoror matrix-based, and the ratio can be controlled to trade off between model size and performance. To query any 3D coordinate, we first project it to the decomposed vectors or matrices as denoted by the black lines, and then perform weighted interpolation. || denotes concatenation along the rank dimension. 3 Methodology 3.1 Preliminaries on Neural Radiance Fields Neural Radiance Fields (Ne RF) [26] represents a 3D volumetric scene with a 5D function fΘ that maps a 3D coordinate x = (x, y, z) and a 2D viewing direction d = (θ, ϕ) into a volume density σ and an emitted color c = (r, g, b). Given a ray r originating at o with direction d, we query fΘ at points xi = o + tid sequentially sampled along the ray to get densities {σi} and colors {ci}. The color of the pixel corresponding to the ray is then estimated by numerical quadrature: i Tiαici, Ti = Y j