# latent_radiance_fields_with_3daware_2d_representations__92438ad4.pdf

Published as a conference paper at ICLR 2025

LATENT RADIANCE FIELDS WITH 3D-AWARE 2D REPRESENTATIONS

Chaoyi Zhou , Xi Liu , Feng Luo, Siyu Huang Visual Computing Division School of Computing Clemson University {chaoyiz,xi9,luofeng,siyuh}@clemson.edu

Groundtruth 3DGS on Latent Space Latent Radiance Fields (ours)

Figure 1: This work novelly enables the radiance field representations on the latent space of VAE, achieving photorealistic 3D reconstruction performance on unbounded outdoor scenes.

Latent 3D reconstruction has shown great promise in empowering 3D semantic understanding and 3D generation by distilling 2D features into the 3D space. However, existing approaches struggle with the domain gap between 2D feature space and 3D representations, resulting in degraded rendering performance. To address this challenge, we propose a novel framework that integrates 3D awareness into the 2D latent space. The framework consists of three stages: (1) a correspondence-aware autoencoding method that enhances the 3D consistency of 2D latent representations, (2) a latent radiance field (LRF) that lifts these 3Daware 2D representations into 3D space, and (3) a VAE-Radiance Field (VAE-RF) alignment strategy that improves image decoding from the rendered 2D representations. Extensive experiments demonstrate that our method outperforms the state-of-the-art latent 3D reconstruction approaches in terms of synthesis performance and cross-dataset generalizability across diverse indoor and outdoor scenes. To our knowledge, this is the first work showing the radiance field representations constructed from 2D latent representations can yield photorealistic 3D reconstruction performance. The project page is latent-radiance-field.github.io.

1 INTRODUCTION

Recently, significant advancement in radiance field representation, such as Neural Radiance Fields (Ne RF) (Mildenhall et al., 2020) and 3D Gaussian Splatting (3DGS) (Kerbl et al., 2023), have been made for fast and high-quality 3D reconstruction and novel view synthesis (NVS). As demonstrated

Equal contribution Corresponding author: Siyu Huang

ar Xiv:2502.09613v1 [cs.CV] 13 Feb 2025

Published as a conference paper at ICLR 2025

by Stable Diffusion Models (Rombach et al., 2021), optimizing in the 2D latent space instead of the image space can significantly boost generation efficiency. Meanwhile, a 3D-consistent latent space and photorealistic decoding capability can benefit many tasks such as text-to-3D generation, latent NVS, few-shot NVS, efficient NVS, 3D latent diffusion model, and 3D semantic understanding. To empower 3D semantic understanding, researchers have explored latent 3D reconstruction methods, such as Feature 3DGS (Zhou et al., 2024a), to distill 2D semantic features into 3D space for novel view semantic segmentation. However, there are significant domain gaps between the 2D feature space and 3D representations, arising from the lack of consistent 3D spatial structure information, which hinders the direct feeding of 2D features into the 3D representations. The 2D feature extractors cannot effectively perceive the 3D structures behind the inputs images since the training images are presented to the network in an unstructured way and the training objective does not include 3D consistency. Therefore, the loss of 3D awareness is inevitable.

Few previous work attempt to bridge the gap between the 2D features and 3D representations. By focusing on better 3D scene understanding, Feature 3DGS (Zhou et al., 2024a) proposes to distill a feature field from 2D semantic features by leveraging the view-independent approach while it cannot model the view-dependent visual properties. Another line of work improves the latent field in the context of 3D generation task. Latent-Ne RF (Metzer et al., 2022) and ED-Ne RF (Park et al., 2023) includes an additional per-scene refinement layer to enhance the latent rendering quality, while introducing more computation cost and exhibiting the generalizabilty.

To bridge the gap between the 2D latent space and 3D representations, we observe two main challenges: Firstly, the massive view-dependent high-frequency noise in the 2D latent space causes the inconsistent geometry and unstable optimization. Moreover, data distribution shift by applying RGB-based NVS methods to latent features also prohibits the photorealistic rendering. To tackle with these two issues, our key insight is to embed 3D awareness into the latent space, while maximumly preserving the representation ability of autoencoders without introducing any additional layers. We propose a novel framework that builds a latent radiance field (LRF) based on the 3D-aware 2D representations. Specifically, it consists of three stages. Firstly, we introduce a correspondenceaware autoencoding method to improve the 3D awaresness of the VAE s latent space, making the 2D representations follow the geometry consistency. Then, we build the LRF to represent 3D scenes from the 3D-aware 2D representations, lifting the 3D-aware 2D representations into the 3D space. Finally, we introduce a VAE-Radiance Field(VAE-RF) alignment method to further mitigate the data distribution shift caused by NVS and boost the performance of image decoding from the rendered 2D representations. In together, the created 3D-aware latent space and LRFs can be smoothly injected into existing NVS or 3D generation pipelines without further fine-tuning, achieving high-quality and photorealistic synthesis results.

To the best of our knowledge, this is the first work demonstrating that radiance field representations constructed in the latent space, with the injection of 3D awareness, can achieve photorealistic 3D reconstruction performance across various settings including indoor and unbounded outdoor scenes. Extensive NVS, 3D generation, and few-shot novel view synthesis experiments show that our method outperforms existing methods with respect to its high-quality synthesis and cross-dataset generalizability, as shown in Fig. 1 and the following sections. In summary, main contributions of this work include:

We introduce a novel framework to integrate 3D awareness into 2D representation learning, including a correspondence-aware autoencoding method and a VAE-Radiance (VAE-RF) field alignment to enable high-quality 3D reconstruction in latent space.

We propose the latent radiance field (LRF) to effectively elevate the 3D-aware 2D representations into 3D latent fields. It represents the first step towards constructing radiance field representations directly in the latent space for 3D reconstruction tasks.

We conduct extensive experiments to show that our method achieves superior fidelity and crossdataset generalizability across NVS, few-shot NVS, and 3D generation tasks.

2 RELATED WORK

Injecting 3D priors into 2D representations. While many existing works focus on incorporating 2D features into 3D representations, which improves performance in downstream tasks such as

Published as a conference paper at ICLR 2025

scene understanding (Zhi et al., 2021; Ha & Song, 2022; Qin et al., 2023; Shi et al., 2023; Zhou et al., 2024a; Cen et al., 2023; Gu et al., 2024; Guo et al., 2024), less attention has been paid to the opposite direction: leveraging 3D knowledge to enhance 2D features, which benefit challenging tasks that require 3D understanding while the perceived information is limited such as monocular depth estimation (Stan et al., 2023; Bhat et al., 2023; Piccinelli et al., 2024; Chatterjee et al., 2024; Moon et al., 2023) and semantic segmentation (Wang et al., 2023; Sun et al., 2024). Studies such as (Bachmann et al., 2022; Zhou et al., 2024a) utilize 3D priors from multi-view and geometric information to improve the Masked Autoencoders (He et al., 2021), achieving better performance on downstream tasks of segmentation and detection. However, directly injecting the geometry constraints into the pre-trained feature extractors is harmful for the self-supervised 2D representation and heavily relying on pre-trained feature extractors poses potential limitations for performance and requires significant computational resources. In contrast, our method does not require any additional per-scene refinement module, serving as an efficient and generalizable approach for injecting 3D priors into 2D representations.

Radiance field representations on images and features. Neural Radiance Fields (Ne RF) (Mildenhall et al., 2020) and 3D Gaussian Splatting (3DGS) (Mildenhall et al., 2020) are benchmark radiance field representation methods for the NVS task. Ne RF represents 3D scenes and renders photorealistic novel views based on the representation capacity of neural networks. 3DGS employs a set of 3D Gaussian primitives to represent 3D scenes, and a fast differentiable rasterizer to enable more efficient rendering while keeping the photorealism of novel views. However, the distillation of the 2D features into the 3D representations remains challenging, mainly due to the significant geometric inconsistency in the feature maps caused by massive high-frequency information. Therefore, some recent literature (Zhou et al., 2024a; Kobayashi et al., 2022; Siddiqui et al., 2023; Fan et al., 2022; Kerr et al., 2023) propose alternative solutions by leveraging the geometry information from the RGB space to help the 3D reconstructions of 2D features. Fit3D (Yue et al., 2024) builds a huge amount of 3D representation dataset as the superivsion for the pre-trained feature extractor fine-tuning; however, without considering the compatibility of the 3D representation and 2D feature space, they also require a customized decoder to ensure the performance in the downstream tasks. All the methods mentioned above all rely on the per-scene optimization with additional modules, while our method bridging the gap between 2D feature space and 3D representation with an efficient correspondence-aware method.

Text-to-3D generation with 2D priors. Despite the impressive 3D generation capabilities demonstrated by many existing 2D generative prior-guided works (Tang et al., 2023; Poole et al., 2022; Wu et al., 2023; Zhou et al., 2024b; Jain et al., 2022; Michel et al., 2021), performing back-propagation of the Score Distillation Sampling (SDS) loss (Poole et al., 2022) on images is computationally intensive and time-consuming. Latent diffusion models (LDMs) offer more efficient solutions by operating in the latent space. However, the vastly different distribution of the latent space means that directly utilizing the latent representations for NVS leads to degraded rendering performance. To our knowledge, only a few works attempt to overcome this challenging task. Latent-Ne RF (Metzer et al., 2022) employs a per-scene refinement layer to map the rendered latent to RGB space as an additional constraint for training the Ne RF representations. ED-Ne RF (Park et al., 2023) introduces a more complex refinement module by initializing from a set of specific layers in a Variational Autoencoder (VAE). Although these per-scene refinement modules effectively mitigate the artifacts in the rendering results, they require resource-consuming optimization for each scene, and lack generalization ability to novel views or scenes. Moreover, the smoothness introduced by the neural networks hinders the reconstruction of high-frequency signals on the 2D features. On the contrary, our method requires no additional efforts for lifting the 2D features to the 3D radiance field representations, such that it can be injected into any existing NVS or text-to-3D frameworks smoothly and efficiently.

3 PRELIMINARIES

Variational autoencoder. A variational autoencoder (VAE) (Kingma, 2013) is a generative model that represents high-dimensional data distributions in a lower-dimensional latent space. The encoder maps the input data x to a latent variable z by estimating the parameters of a posterior distribution qϕ(z|x). The posterior is typically assumed to follow the Gaussian distribution, parameterized by a mean µϕ(x) and a variance σϕ(x). The latent variable z is sampled from this posterior distribution,

Published as a conference paper at ICLR 2025

i.e., z qϕ(z|x) = N(z; µϕ(x), σϕ(x)2). The decoder reconstructs the input x by mapping z back to the data space through the likelihood pθ(x|z). The learning objective of is:

LVAE(θ, ϕ; X) = Eqϕ(Z|X)[log pθ(X|Z)] KL(qϕ(Z|X) p(Z)). (1)

3D Gaussian Splatting. 3DGS (Kerbl et al., 2023) is an efficient NVS framework that uses a set of 3D Gaussian primitives to represent a scene explicitly. Each Gaussian primitive has a position vector µ R3, a 3D covariance matrix Σ R3 3, an opacity α R, and a spherical harmonics (SH) coefficient c Rk (Ramamoorthi & Hanrahan, 2001) representing the view dependent colors.

2 (x µ)T Σ 1(x µ), (2)

where Σ = RSST RT , S denotes the scaling matrix and R is the rotation matrix. Then, rasterization (Zwicker et al., 2001) can transform the 3D Gaussian spheres to the 2D camera plane to calculate the 2D covariance matrix in the camera space as

Σ = JWΣW T JT , (3)

where W is the perspective transformation matrix and J is Jacobin of the projection matrix. For every pixel, the Gaussians are traversed in depth order from the image plane, and their pixel colors ci are combined through alpha compositing, forming pixel color C as

j=1 (1 αj) . (4)

In this work, we propose a method to achieve 3D-aware 2D representations and enable 3D reconstruction in the latent space. We base our method on the widely used Variational Autoencoder (VAE) from Latent Diffusion models (Metzer et al., 2022). To enhance the 3D awareness of both encoder and decoder of the VAE, we present a three-stage pipeline as illustrated in Fig. 2. The first stage focuses on improving the 3D awaresness of the VAE s encoder through a novel correspondenceaware constraint on the latent space, making the 2D representations follow the geometry consistency (Sec. 4.1); The second stage builds a latent radiance field (LRF) to represent 3D scenes from the 3D-aware 2D representations (Sec. 4.2); The third stage further introduces a VAE-Radiance Field (VAE-RF) alignment method to boost the reconstruction performance (Sec. 4.3). In together, our LRF enables 3D reconstruction on the 2D latent space instead of the image space. It can render high-quality and photorealistic novel views, even for the unbounded scenes (Sec. 5). More details of our method are discussed in the following sections.

4.1 CORRESPONDECE-AWARE AUTOENCODING

The first stage of our method is incorporating the geometry-awareness into the autoencoding process. Given K muilt-view images I = {Ii}K i=1 , Ii RH W 3 , the VAE encoder extracts 2D

representations Z = {Zi}K i=1 , Zi RH W 4 in a low-dimensional latent space while the semantic information can be preserved effectively. However, as shown in Fig. 4, most of existing NVS frameworks fail to reconstruct the photo-realistic images from the rendered latents. It is mainly because the VAE encoding process significantly damages the multi-view consistency within the original image space, since the latent space presents massive high-frequency noises to compress the original RGB space into a compact latent space (see Fig. 3). This brings severe challenges for reconstructing the 2D latent representations in the 3D space.

Correspondence consistency on the latent space. To address the above issue and enable effective latent 3D reconstruction, we are inspired by the multi-view correspondence consistency which serves as the foundation principle for modeling the natural 3D world. Specifically, points xi R2 in image Ii and points xj R2 in another image Ij are considered correspondences if they are connected by the fundamental matrix Fij R3 3, satisfying the multi-view geometry constraint (Sch onberger & Frahm, 2016):

x j Fijxi = 0. (5)

Published as a conference paper at ICLR 2025

Figure 2: An illustration of our pipeline for creating a latent radiance field in conjunction with 3D-aware 2D representation fine-tuning. Firstly in Stage-I, we inject 3D awareness into the VAE s encoder through applying a novel correspondence consistency constraint on the latent space, making the 2D representations follow the geometry consistency. Then in Stage-II, we create the latent radiance field (LRF) to represent 3D scenes based on the 3D-aware 2D representations. Finally in Stage-III, we introduce a VAE-Radiance Field alignment method to enhance the performance of image decoding from the rendered latent space.

Eq. 5 tells that a pair of correspondence points on the image space should be close to each other, so that the consistent geometry can be ensured during the optimization in the 3D space; otherwise, the artifacts and redundant geometry representation due to the local optimal will damage the quality of the 3D reconstruction and novel view synthesize. Motivated by this, we propose an computationally efficient strategy that incorporates the correspondence consistency into the autoencoder training. Specifically, a set of multi-view images I = {Ii}K i=1 , Ii RH W 3 are fed into the autoencoder

to extract the latent representations Z = {Zi}K i=1 , Zi RH W 4 , and the correspondence consistency loss on the latent space is computed by

j K(i) λij zi zj 1 . (6)

where zi refers to the the latent pixel in the Zi and zi is the corresponding latent pixel in the neighbouring latent Zj. Lcorres ensures that the encoded features follow the correspondence consistency derived from the multi-view images, where λij is the weight based on the average pose error (APE) calculated from the Frobenius norm between the two camera poses of images Ii and Ij to weight the accurate pose distance to represent the view-dependant latent codes. The detail of calculating λij can be found in Appendix A.1 By injecting the latent correspondence consistency into the standard VAE training, our VAE training objective is:

LStage I = LVAE + λ1Lcorres + λ2Lreg. (7)

LVAE is original VAE traning objective for VAE in Eq. 1. Lreg = KL (q(Z|X) qoriginal(Z|X)) enforces the fine-tuned 2D representations being close to those of the pre-trained VAE, preserving the representation capability of the finet-tuned autoencoder. This new learning objective ensures that the compact latent space of VAE preserves the multi-view geometric consistency, such that it is compatible with existing NVS frameworks such as 3DGS.

Insight into latent correspondence consistency. The maximum degree of the spherical harmonics is always set as 3 in NVS methods for the efficiency and robustness in the modeling the viewdependant information. To be more specific, the lower degree terms is aim to mostly capture lowfrequency information such as albedo for the scene while the higher degrees are tended to model the high-frequency, view dependent information such as the lightning. For the latent space, the latent code can be considered as the combination of the base value and high frequency noise. Due to such a compact representation, the amount of the noise can be greatly increase compared to the RGB space, creating more difficulties for the SH coefficients to model the information from different views. When maximum degree is fixed, it is easier for SH coefficients to reach the global optimal instead of locally over-fitting. Fortunately, with our Lcorres, the high frequency noise can be

Published as a conference paper at ICLR 2025

Image VAE latent Finetuned latent VAE latent FFT Finetuned latent FFT

Figure 3: A visualization of latent spaces of original and our fine-tuned VAEs. Our method ensures an accurate geometry in the latent space while removing a certain amount of high-frequency noises.

effectively removed while the high-quality image generative ability can still be preserved, leading to a more stable process of the optimization and consistent geometry representation. Fig. 3 shows that the correspondence-aware encoding can significantly remove the high frequency noises in the 2D latent space and the visualization of applying Fast Fourier transform also showing less highfrequency noise in latent space achieved by our encoder, resulting an effective approach to lifting the 2D features into the 3D latent fields.

4.2 LATENT RADIANCE FIELD

Based on the 3D-aware 2D representation fine-tuning discussed in Sec. 4.1, we create 3D representations directly in the 2D latent space of VAE, namely the latent radiance field (LRF). We take 3DGS (Kerbl et al., 2023) as an example of radiance field representations to discuss our LRF.

By following 3DGS, a set of latent 3D Gaussians is formulated as

G = {(µ, s, R, α, SHf)j)}1 j M, (8)

where µ R3 is the 3D mean of the Gaussian, S = diag(s) R3 3 is the Gaussian scale, R R3 3 its orientation, α R a per-Gaussian opacity, and SHf models the view-dependant latent in the 3D latent space. By following the differentiable rasterization process of 3DGS, we rasterize the 2D latent representations using point-based α-blending as follows:

j=1 (1 αi), (9)

where N is a set of ordered Gaussians overlapping the pixel, zi Rdim is the view-dependent latent code of each Gaussian, where dim is the number of the latent dimension of the feature. and αi is given by evaluating a 2D Gaussian with covariance Σ multiplied with a learned per-point opacity. Let I = {Ii}K i=1 , Ii RH W 3 be a set of multi-view images of a scene with corresponding camera parameters. Let Z = {Zi}K i=1 , Zi RH W 3 be a corresponding set of latents from the VAE encoder. The rasterization function r renders a set of latent Gaussians into a 2D latent representation according to the camera pose Pi. Then, we optimize the latent Gaussian parameters, to optimally represent latent Z:

ˆG = arg min {(µ,s,R,α,SHf }

i=1 Lf(r(G, Pi), Zi), (10)

where Lf is a pixel-wise l1 loss combined with a D-SSIM term. Notably, we do not need to impose additional geometric consistency constraints introduced by previous literature (Yue et al., 2024; Kobayashi et al., 2022; Zhou et al., 2024a), as our correspondence-aware autoencoder fine-tuning ensures geometrically consistent 2D representations in the 3D space. Therefore, our LRF reconstructs the 2D latent representations as a radiance field representation directly, and enables an efficient rendering of the 2D latent representations for novel views.

4.3 VAE-RADIANCE FIELD ALIGNMENT

Although the correspoondence-aware autoencoding introduced in Sec. 4.1 improves the 3D consistency of VAE latent space, the LRF distribution p(z NVS) are still shifted from the VAE latent

Published as a conference paper at ICLR 2025

Table 1: Our method outperforms the image and latent space NVS baselines on most settings and metrics, from object-level to unbounded outdoor scenes. Latent-Ne RF denotes we adapt it to NVS.

Image Space Latent Space Dataset Metric 3DGS/8 Mip-Splatting/8 3DGS-VAE Latent-Ne RF Feature-GS 3DGS-LRF (Ours)

MVImg Net PSNR 16.93 24.89 25.04 18.50 21.09 26.26 SSIM 0.561 0.799 0.824 0.709 0.772 0.863 LPIPS 0.466 0.328 0.250 0.403 0.372 0.178

Ne RF-LLFF PSNR 9.98 19.68 19.07 18.31 16.48 20.00 SSIM 0.110 0.484 0.493 0.457 0.415 0.541 LPIPS 0.631 0.513 0.364 0.387 0.539 0.289

DL3DV-10K PSNR 14.03 21.81 20.57 18.16 16.60 22.45 SSIM 0.352 0.609 0.595 0.530 0.449 0.667 LPIPS 0.541 0.451 0.346 0.432 0.602 0.197

Mip-Ne RF360 PSNR 14.79 22.38 19.44 15.93 17.13 20.83 SSIM 0.273 0.502 0.404 0.312 0.337 0.469 LPIPS 0.586 0.521 0.432 0.537 0.642 0.328

distribution p(z VAE) due to the non-linearity in neural rendering, resulting in performance decrease when we decode LRF rendering results back to images through the VAE decoder.

We further propose to fine-tune the VAE decoder under the radiance field guidance to address this issue. With the LRF built in Sec. 4.2, we can reconstruct LRFs from a large amount of scenes to generate a latent-image paired dataset. This dataset consists of the 2D latent representations Z = {Zi}K i=1 , Zi RH W 4 rendered by LRFs and the corresponding ground truth images

I = {Ii}K i=1 , Ii RH W 3 . Notably, we also include the training views of LRFs in this dataset, since a key feature of existing NVS methods is to overfit the training views. The training objective of our VAE-RF alignment decoder fine-tuning is:

LStage III = λtrain D(Ztrain) Itrain 1 + λnovel D(Znovel) Inovel 1 , (11)

where D( ) is the decoder, Ztrain and Znovel are the latent codes of the training views and novel views, respectively. I refer to the corresponding ground truth images. λnovel and λnovel are the weighting coefficient that balances the contributions of the training and novel views. Both of the weights are set to 0.5 to ensure that the decoder learns not only to decode effectively from the training views but also to generalize and perform well on the novel views. Eq. 11 effectively minimizes the distribution mismatch between the VAE latent space and the LRF rendering space. After decoder fine-tuning, high-quality images can be reconstructed from the LRF rendering of either training or novel views. The fine-tuned autoencoder enhances 3D reconstruction and generation by providing a geometryaware 2D latent space as well as a radiance field-compatible autoencoder.

5 EXPERIMENTS

5.1 LATENT 3D RECONSTRUCTION

We first evaluate LRF on four real-world datasets, including MVImg Net (Yu et al., 2023), Ne RFLLFF (Mildenhall et al., 2019), Mip Ne RF360 (Barron et al., 2022), and DL3DV-10K Ling et al. (2024), to demonstrate the effectiveness of our approach for latent 3D reconstruction. Among these datasets, DL3DV serves as an in-distribution dataset, where the training set is used for model training, and the test set is used for evaluation. In contrast, MVImg Net, LLFF, and Mip-Ne RF360 are out-of-distribution datasets, as they have never been used in the training process. We follow the standard train and test split in 3DGS and Mip-Splatting(Kerbl et al., 2023; Yu et al., 2024).

Fig. 4 shows that our method significantly improves the capability of the 2D latent representations for 3D reconstruction task. Our approach mitigates the artifacts such as ghosting, color distortion, blurring, and texture warping caused by 3D inconsistency. While the latent and image space approaches share the same input resolution, our rendering results present clearer visual details, richer textures, and more high-frequency information.

Published as a conference paper at ICLR 2025

Mip-Ne RF360 Ne RF-LLFF DL3DV-10K

Groundtruth Ours 3DGS-VAE

Mip-Splatting/8 Feature-GS 3DGS/8

Groundtruth Ours 3DGS-VAE

Mip-Splatting/8 Feature-GS 3DGS/8

Groundtruth Ours 3DGS-VAE

Mip-Splatting/8 Feature-GS 3DGS/8

Groundtruth Ours 3DGS-VAE

Mip-Splatting/8 Feature-GS 3DGS/8

Figure 4: A visual comparison of rendering results. Our method can not only render high-quality images for in-distribution dataset (DL3DV-10K), but also shows great generalization ability across different datasets.

Published as a conference paper at ICLR 2025

Ours Dreamfusion Ours Dreamfusion Ours GSGEN

A lego man A vase with pink flowers A DSLR photo of a tray of sushi containing pugs

Figure 5: Visual comparison of different text-to-3D generation methods. Our model enables the generation of more view-consistent results.

Table 2: A comparison of different methods on LLFF dataset using 3 views.

Method PSNR SSIM LPIPS

3DGS-VAE 13.06 0.283 0.570 3DGS 13.79 0.331 0.468 Mip-Splatting 13.70 0.315 0.486 Ours 15.51 0.379 0.465

As shown in Table 1, LRF achieves the state-of-the-art performance across all datasets in terms of metrics of PSNR, SSIM, and LPIPS. These results underscore the effectiveness of our approach in fine-tuning latent space representations to support novel view synthesis. This demonstrates that our fine-tuning approach not only effectively reduces the geometry information loss caused by 3D-inconsistent 2D representations but also preserves perceptual and textural information in NVS outputs. Compared to the original VAE model, our finetuning approach significantly enhances 3D-consistency in the 2D latent representations by enforcing the correspondence points to be consistent, resulting in superior latent NVS performance across all metrics. Image Space means that we input images to 3DGS with the same resolution as the latent representations, then render output images with the same resolution as before latent encoding. Since we render high-resolution images from low-resolution training images, to avoid unfair comparisons caused by aliasing, we also compare our method with the Mip-Splatting (Yu et al., 2024) which is specialized at super-resolution rendering. Compared with these image space methods, our latent reconstruction method still achieves better performance on most of the datasets, highlighting its potential for future work in efficient 3D representation learning.

We also show the generalizability by performing the synthesis of few-shot novel views in Ne RFLLFF dataset. We follow the same experimental configurations as in the previous work (Li et al., 2024; Liu et al., 2024). And we keep the same input resolution for all the methods. As shown in Table 2, our method outperforms the other image-space approaches in the sparse-view setting.

5.2 TEXT-TO-3D GENERATION

We evaluate our method for the state of art text-to-3D generation framework in both latent and image space. We leverage the GSGEN (Chen et al., 2024) and Dreamfusion (Poole et al., 2022) as the image space generation framework, while we use Latent-Ne RF (Metzer et al., 2022) as the latent space method. GSGEN is optimized in the 512 512 image space. Dreamfusion is optimized in the 800 800 image space. Latent-Ne RF is optimized in the 128 128 latent space and then reconstruct images to a resolution of 1024 1024. By following the prompts evaluated in these two works, we generate 3D objects and render them from multiple views. The text prompts fed into the GSGEN are more complicated considering it is the state of the art generation method.

As shown in Fig. 5, our method can boost the performance under extremely complicated text prompts, achieve complex geometry while preserving the multi-view consistency. Moreover, our encoder model can significantly enhance the high-frequency details such as the texture of the fried chicken. Besides, our approach is compatible with the diffusion model operating within the original VAE latent space. Without necessitating any fine-tuning of the diffusion U-Net parameters, the diffusion process remains capable of accurately denoising the 2D latent representations provided by our fine-tuned VAE, according to the text guidance. Furthermore, the VAE-RF alignment in decoder

Published as a conference paper at ICLR 2025

Table 3: We ablate correspondence-aware autoencoding and VAE-radiance field aligned decoder fine-tuning on DL3DV-10K dataset to reveal their necessity in latent 3D reconstruction .

VAE Encoder fine-tuned Decoder fine-tuned PSNR SSIM LPIPS

- - 20.57 0.595 0.346 - 21.16 0.620 0.282 - 21.73 0.645 0.208 22.45 0.667 0.197

Groundtruth

No fine-tuning Encoder fine-tuning

Decoder fine-tuning Ours

Figure 6: A qualitative study of the effect of different fine-tuning stages for view synthesis results.

fine-tuning also facilitates the reconstruction of rendered latent representations, improving the image quality after VAE decoding.

5.3 ABLATION STUDY

We conduct ablation studies on two major components of our three-stage framework, the correspondence-aware autoencoding and the VAE-RF aligned decoder fine-tuning, to assess their contributions to overall performance. The quantitative results shown in Table 3 indicate that both components contribute to performance improvement. Notably, the decoder presents a more significant impact on the results, as it directly influences the reconstruction of images from the latent space, thereby leading to stronger performance gains. Although the encoder does not directly act on image reconstruction, it enhances geometric consistency of 2D representations, which also contributes to the performance improvement in 3D reconstruction.

The qualitative results are shown in Fig. 6. The encoder fine-tuning allows the 3D latent space to capture more precise geometry, reduce blurriness in the synthesized images, and recover finer details. Additionally, the decoder fine-tuning further refines the results by rectifying inaccuracies and preserving perceptual and textural fidelity. Together, these modules synergistically contribute to significant improvements in the overall pipeline.

6 CONCLUSION

This paper introduces the Latent Radiance Field (LRF), which to our knowledge, is the first work to construct radiance field representations directly in the 2D latent space for 3D reconstruction. We present a novel framework for incorporating 3D awareness into 2D representation learning, featuring a correspondence-aware autoencoding method and a VAE-Radiance Field (VAE-RF) alignment strategy to bridge the domain gap between the 2D latent space and the natural 3D space, thereby significantly enhancing the visual quality of our LRF. Future work will focus on incorporating our method with more compact 3D representations, efficient NVS, few-shot NVS in latent space, as well as exploring its application with potential 3D latent diffusion models.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENT

The authors thank Minghui Xu for beneficial discussions. This work is partially supported by the AIM for Composites, an Energy Frontier Research Center funded by the U.S. Department of Energy (DOE), Office of Science, Basic Energy Sciences (BES), under Award # DE-SC0023389 and by the US National Science Foundation (NSF; Grant Number MTM2-2025541, OIA-2242812). The authors acknowledge research support from Clemson University with a generous allotment of computation time on the Palmetto cluster.

Roman Bachmann, David Mizrahi, Andrei Atanov, and Amir Zamir. Multi MAE: Multi-modal multitask masked autoencoders. 2022.

Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. CVPR, 2022.

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M uller. Zoedepth: Zeroshot transfer by combining relative and metric depth. ar Xiv preprint ar Xiv:2302.12288, 2023.

Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, and Qi Tian. Segment any 3d gaussians. ar Xiv preprint ar Xiv:2312.00860, 2023.

Agneet Chatterjee, Tejas Gokhale, Chitta Baral, and Yezhou Yang. On the robustness of language guidance for low-level vision tasks: Findings from depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2794 2803, June 2024.

Zilong Chen, Feng Wang, Yikai Wang, and Huaping Liu. Text-to-3d using gaussian splatting, 2024. URL https://arxiv.org/abs/2309.16585.

Zhiwen Fan, Peihao Wang, Xinyu Gong, Yifan Jiang, Dejia Xu, and Zhangyang Wang. Nerf-sos: Any-view self-supervised object segmentation from complex real-world scenes. ar Xiv e-prints, pp. ar Xiv 2209, 2022.

Michael Grupp. evo: Python package for the evaluation of odometry and slam. https: //github.com/Michael Grupp/evo, 2017.

Qiao Gu, Zhaoyang Lv, Duncan Frost, Simon Green, Julian Straub, and Chris Sweeney. Egolifter: Open-world 3d segmentation for egocentric perception. ar Xiv preprint ar Xiv:2403.18118, 2024.

Jun Guo, Xiaojian Ma, Yue Fan, Huaping Liu, and Qing Li. Semantic gaussians: Open-vocabulary scene understanding with 3d gaussian splatting, 2024.

Huy Ha and Shuran Song. Semantic abstraction: Open-world 3D scene understanding from 2D vision-language models. In Proceedings of the 2022 Conference on Robot Learning, 2022.

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ar, and Ross Girshick. Masked autoencoders are scalable vision learners. ar Xiv:2111.06377, 2021.

Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, and Ben Poole. Zero-shot text-guided object generation with dream fields. 2022.

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023. URL https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/.

Justin* Kerr, Chung Min* Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. In International Conference on Computer Vision (ICCV), 2023.

Diederik P Kingma. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Published as a conference paper at ICLR 2025

Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitzmann. Decomposing nerf for editing via feature field distillation. In Advances in Neural Information Processing Systems, volume 35, 2022. URL https://arxiv.org/pdf/2205.15585.pdf.

Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, and Lin Gu. Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization. ar Xiv preprint ar Xiv:2403.06912, 2024.

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22160 22169, 2024.

Xi Liu, Chaoyi Zhou, and Siyu Huang. 3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view-consistent 2d diffusion priors. In Advances in Neural Information Processing Systems (Neur IPS), 2024.

Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. Latent-nerf for shape-guided generation of 3d shapes and textures. ar Xiv preprint ar Xiv:2211.07600, 2022.

Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka. ar Xiv preprint ar Xiv:2112.03221, 2021.

Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM Transactions on Graphics (TOG), 2019.

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.

Jaeho Moon, Juan Luis Gonzalez Bello, Byeongjun Kwon, and Munchurl Kim. From-groundto-objects: Coarse-to-fine self-supervised monocular depth estimation of dynamic objects with ground contact prior. ar Xiv preprint ar Xiv:2312.10118, 2023.

Jangho Park, Gihyun Kwon, and Jong Chul Ye. Ed-nerf: Efficient text-guided editing of 3d scene using latent space nerf. ar Xiv preprint ar Xiv:2310.02712, 2023.

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Uni Depth: Universal monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. ar Xiv, 2022.

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. ar Xiv preprint ar Xiv:2312.16084, 2023.

Ravi Ramamoorthi and Pat Hanrahan. An efficient representation for irradiance environment maps. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 497 500, 2001.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. Highresolution image synthesis with latent diffusion models, 2021.

Johannes Lutz Sch onberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao-Hua Guan. Language embedded 3d gaussians for open-vocabulary scene understanding. ar Xiv preprint ar Xiv:2311.18482, 2023.

Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bul o, Norman M uller, Matthias Nießner, Angela Dai, and Peter Kontschieder. Panoptic lifting for 3d scene understanding with neural fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9043 9052, June 2023.

Published as a conference paper at ICLR 2025

Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, et al. Ldm3d: Latent diffusion model for 3d. ar Xiv preprint ar Xiv:2305.10853, 2023.

Boyuan Sun, Yuqi Yang, Le Zhang, Ming-Ming Cheng, and Qibin Hou. Corrmatch: Label propagation via correlation matching for semi-supervised semantic segmentation. IEEE Computer Vision and Pattern Recognition (CVPR), 2024.

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. ar Xiv preprint ar Xiv:2309.16653, 2023.

Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren Zhou, Xinggang Wang, and Chang Zhou. One-peace: Exploring one general representation model toward unlimited modalities. ar Xiv preprint ar Xiv:2305.11172, 2023.

Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, and Aleksander Holynski. Reconfusion: 3d reconstruction with diffusion priors. ar Xiv, 2023.

Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang Wu, Zizheng Yan, Tianyou Liang, Guanying Chen, Shuguang Cui, and Xiaoguang Han. Mvimgnet: A large-scale dataset of multi-view images. In CVPR, 2023.

Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Aliasfree 3d gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19447 19456, June 2024.

Yuanwen Yue, Anurag Das, Francis Engelmann, Siyu Tang, and Jan Eric Lenssen. Improving 2D Feature Representations by 3D-Aware Fine-Tuning. In European Conference on Computer Vision (ECCV), 2024.

Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and Andrew J. Davison. In-place scene labelling and understanding with implicit scene representation. 2021.

Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21676 21685, 2024a.

Shijie Zhou, Zhiwen Fan, Dejia Xu, Haoran Chang, Pradyumna Chari, Suya Bharadwaj, Tejas You, Zhangyang Wang, and Achuta Kadambi. Dreamscene360: Unconstrained text-to-3d scene generation with panoramic gaussian splatting. ar Xiv preprint ar Xiv:2404.06903, 2024b.

Matthias Zwicker, Hanspeter Pfister, Jeroen Van Baar, and Markus Gross. Surface splatting. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 371 378, 2001.

Published as a conference paper at ICLR 2025

A.1 DETAILS OF CALCULATING THE WEIGHT

To compute λij, we first calculate the Absolute Pose Error (APE) for each pose pair using the formula: Eij = P 1 i Pj, where Pi and Pj are the different camera poses respectively. After obtaining Eij, the APE is calculated as:

APEij = Eij I4 4 F , (12)

where I4 4 is the identity matrix F and represents the Frobenius norm. In each iteration, the APE values are normalized across all image pairs to derive the weights λij, as: λij = AP Eij P

k AP Ek , where k represents each image pair within one iterations. This normalization ensures they reflect the relative contributions of each pose error in a consistent manner. This method is implemented based on the APE computation approach in the evo library (Grupp, 2017).

A.2 DETAILS OF DATASET

We create a correspondence pair dataset based on the training set of DL3DV-10K (Ling et al., 2024) dataset to fine-tune our VAE encoder. We randomly sample 784 scenes and extract correspondence pairs from the multi-view images by using COLMAP. The correspondence points for each scene will be pre-computed before the model fine-tuning process. We use a sequential matcher with the number of overlapping images set to 10 and the number of quadratic overlaps set to 1. Such overlapping searching strategy ensures our model not only learns from easy and dense correspondence, but also from challenging cases among far-view image pairs, adding great robutness for our model. The ability to remain consistency in large view difference is particularly necessary for the outdoor unbounded reconstruction. Moreover, we set the minimum number of inliers and minimum ratio of inliers to 15 and 0.25 with the loop detection to make sure the extracted correspondence is accurate enough. We also train the same number of latent 3D Gaussian splatting scenes from the DL3DV10K datasets to create a paired dataset of images and rendered latents, which are used for Stage-III decoder fine-tuning.

A.3 IMPLEMENTATION DETAILS

For Stage-I, we employ the pre-trained VAE model (f = 8, KL), from LDM model zoo as the backbone VAE model. We fine-tune the VAE on 2 NVIDIA A100-80GB GPUs for around one day, by using the correspondence pair dataset with an image resolution of 512 512, the base learning rate of 4.5e 06, and the default optimizer. For Stage-III, we fine-tune the decoder on the image-latent dataset with 2 NVIDIA A100-80GB GPUs for around one day.

In the implementation of LRF, we normalize the latent input to the radiance field using the scale of all input views to stabilize radiance field optimization, and apply denormalization during rendering. During the VAE encoding stage, we start the discriminator at step 501 for better image quality, and we set KLweight = 1.0 10 6, and Dweight = 0.5. For the decoder training, we use the same configuration as the original VAE, except KLweight = 0 to ensure only the decoder was optimized.

A.4 IMAGE RECONSTRUCTION PERFORMANCE

To verify that our approach does not degrade the performance on downstream tasks, we evaluate the image reconstruction performance of our fine-tuned VAE by calculating PSNR between the original images and the reconstructed images. As shown in Table 4, adding the correspondence consistency constraint to inject 3D awareness and applying a regularization loss to keep the latent space close to the original latent space perform minimal impact on the VAE s reconstruction performance. This ensures that our VAE model can still be effectively used in conjunction with other pre-trained models, such as the Stable Diffusion model, without any fine-tuning.

Published as a conference paper at ICLR 2025

Table 4: Evaluation of PSNR for images reconstructed by VAEs on Ne RF-LLFF, DL3DV-10K, and Mip-Ne RF360 datasets.

Method Metric Ne RF-LLFF DL3DV-10K Mip-Ne RF360

VAE PSNR 23.47 24.59 24.54 Our-VAE PSNR 23.59 23.25 24.24

A.5 MORE IMAGE GENERATION RESULTS

Fig. 7 demonstrates that our VAE model can generate 3D objects guided by text prompts without any fine-tuning of the diffusion model. Moreover, Fig. 8 shows that our VAE can also improve the GSGEN (Chen et al., 2024) to achieve better 3D generations with complicated text prompts.

A.6 EFFICIENCY ANALYSIS

Table 5 demonstrates that our method reduces input resolutions, model storage space, and GPU usage for photorealistic NVS, which is particularly useful in cases with limited communication bandwidth and storage. For instance, some individuals may not have GPUs with large memories, where our method is an efficient solution for them to run photorealistic NVS algorithms.

Table 5: Efficiency comparison of different image-space and latent-space NVS methods.

Method Input resolution Training Time GPU Usage Storage Rendering FPS Decoding FPS PSNR SSIM LPIPS

3DGS 512 512 5.9 min 3 GB 200.41 MB 100 - 26.17 0.778 0.009 3DGS/8 64 64 3.1 min 1 GB 59.15 MB 200 - 14.03 0.352 0.541 3DGS-VAE 64 64 4.8 min 2 GB 250.97 MB 80 20 20.57 0.595 0.346 Latent-Ne RF 64 64 27.2 min 10 GB 350.50 MB 0.09 20 18.16 0.530 0.432 Ours 64 64 3.9 min 1 GB 96.42 MB 180 20 22.45 0.667 0.197

A.7 MORE EXPERIMENTAL RESULTS

To demonstrate the effectiveness and generalizability of our method for 3D latent reconstruction, we show more NVS and 3D generation results on four datasets covering indoor scenes, outdoor scenes, and object-level scenes. As shown in Fig. 9, 10, 11 and 12, our method yields a significant improvement in image quality.

Published as a conference paper at ICLR 2025

Ours Dreamfusion Ours Dreamfusion Ours Latent-Ne RF

A small saguaro cactus planted in a clay pot

A hamburger A stack of pancakes covered in maple syrup

An ice cream A temple A lego man

Figure 7: Samples for text-to-3D generation on the image and latent space.

Ours GSGEN Ours GSGEN Ours GSGEN

A DSLR photo of a tray of sushi containing pugs

A zoomed out DSLR photo of a cake in the shape of a train

A zoomed out DSLR photo of a plate of fried chicken and waffles

Figure 8: More samples for text-to-3D generation on the image space.

Published as a conference paper at ICLR 2025

Groundtruth Ours 3DGS-VAE

Mip-Splatting Feature-GS 3DGS

Figure 9: More NVS results on the DL3DV-10K dataset.

Published as a conference paper at ICLR 2025

Groundtruth Ours 3DGS-VAE

Mip-Splatting Feature-GS 3DGS

Figure 10: More NVS results on the Ne RF-LLFF dataset.

Published as a conference paper at ICLR 2025

Groundtruth Ours 3DGS-VAE

Mip-Splatting Feature-GS 3DGS

Figure 11: More NVS results on the Mip-Ne RF360 dataset.

Published as a conference paper at ICLR 2025

Groundtruth Ours 3DGS-VAE Mip-Splatting Feature-GS 3DGS

Figure 12: More NVS results on the MVImg Net dataset.