# rematching_dynamic_reconstruction_flow__45339150.pdf

Published as a conference paper at ICLR 2025

REMATCHING DYNAMIC RECONSTRUCTION FLOW

Sara Oblak 1 Despoina Paschalidou 1 Sanja Fidler1 2 3 Matan Atzmon 1

1 NVIDIA 2 University of Toronto 3 Vector Institute {soblak,dpaschalidou,sfidler,matzmon}@nvidia.com

Reconstructing a dynamic scene from image inputs is a fundamental computer vision task with many downstream applications. Despite recent advancements, existing approaches still struggle to achieve high-quality reconstructions from unseen viewpoints and timestamps. This work introduces the Re Matching framework, designed to improve reconstruction quality by incorporating deformation priors into dynamic reconstruction models. Our approach advocates for velocity-field-based priors, for which we suggest a matching procedure that can seamlessly supplement existing dynamic reconstruction pipelines. The framework is highly adaptable and can be applied to various dynamic representations. Moreover, it supports integrating multiple types of model priors and enables combining simpler ones to create more complex classes. Our evaluations on popular benchmarks involving both synthetic and real-world dynamic scenes demonstrate that augmenting current state-of-the-art methods with our approach leads to a clear improvement in reconstruction accuracy.

1 INTRODUCTION

This work addresses the challenging task of novel-view dynamic reconstruction. That is, given a set of images of a dynamic scene evolving over time, the task objective is to render images from any novel view or intermediate point in time. Despite significant progress in recent years (Lombardi et al., 2021; Fridovich-Keil et al., 2023; Yunus et al., 2024), effectively learning representations of dynamic scenes still remains an open challenge. The main hurdle arises from the typically sparse nature of multi-view inputs, both temporally and spatially. To address this, prior knowledge is often incorporated - either from a physical prior such as rigidity (Sorkine & Alexa, 2007), or data-driven priors derived from large foundation models (Ling et al., 2024; Wang et al., 2024). However, existing approaches struggle with an inherent trade-off: while priors improve generalization, they often compromise reconstruction fidelity by failing to match exactly the given input. In turn, designing a method that effectively integrates priors without sacrificing high-fidelity reconstructions remains unresolved.

To address this issue, we introduce the Re Matching framework, a novel approach for designing and integrating deformation priors into dynamic reconstruction models. The Re Matching framework has three core features: i) optimization objective that aligns reconstruction solutions with prior regularization as closely as possible, without sacrificing fidelity; ii) ensured applicability to various model functions, including time-dependent rendered pixels or particles representing scene geometry; and, iii) provide a flexible design of deformation prior classes, allowing more complex classes to be built from simpler ones.

To support the usage of rich deformation prior classes, we advocate for priors expressed through velocity fields. A velocity field is a mathematical object that describes the instantaneous change in time the deformation induces. As such, a velocity field can potentially provide a simpler characterization of the underlying flow deformation. For example, the complex class of volume-preserving flow deformations is characterized by the condition of being generated by divergence-free velocity fields (Eisenberger et al., 2019). However, representing a deformation through its generating velocity field typically necessitates numerical simulation for integration, a procedure that can be computationally expensive and time-consuming. Nevertheless, recent progress in flow-based generative models (Ben Hamu et al., 2022; Lipman et al., 2022; Albergo et al., 2023) supports simulation-free flow training, inspiring this work to explore simulation-free training for flow-based dynamic reconstruction models. Therefore, our framework is specifically designed to integrate with dynamic reconstruction models that represent dynamic scenes directly through time-dependent reconstruction functions (Pumarola et al., 2021; Yang et al., 2023).

Published as a conference paper at ICLR 2025

Exploiting the simplicity offered by velocity-field-based deformation prior classes, we observe that the projection of a time-dependent reconstruction function onto a velocity-field prior class can be framed as a flow-matching problem. The opportunity to access the projected flow is reminiscent of the Alternating Projections Method (APM) (Deutsch, 1992), a greedy algorithm guaranteed in finding the closest points between two sets. Therefore, we suggest an optimization objective aimed at re-projecting back onto the set of reconstruction flows. This corresponds to a flow-matching loss that we term the Re Matching loss. Our hypothesis is that by mimicking the APM, this optimization would converge to solutions that not only meet the reconstruction objective, but also reach the closest possible alignment to the required prior class. By doing so, we achieve the desired goal of improving generalization to unseen timestamps without compromising solutions fidelity levels.

We instantiate our framework with a dynamic model based on the popular Gaussian Splats (Kerbl et al., 2023) rendering model. We explore several constructions for deformation prior classes including piece-wise rigid and volume-preserving deformations. Additionally, we demonstrate our framework s usability for two different types of time-dependent functions: rendered image pixels color, and particle positions representing scene geometry. Lastly, we evaluate our framework on standard dynamic reconstruction benchmarks, involving both synthetic and real-world scenes, and showcase clear improvement in reconstruction quality.

Our contributions. In summary, the main contributions of this paper are:

1. We propose the Re Matching framework, which controls the optimization of dynamic reconstruction models to converge to solutions that closely align with a predefined prior class of deformations. In turn, the framework balances achieving high-fidelity reconstructions with leveraging the benefits of adhering to prior assumptions.

2. The framework unifies various types of model functions, including geometry representations and image rendering, under a single cohesive approach, ensuring broad applicability and making future advancements within it relevant to a wide range of models.

3. The framework allows for the combination of multiple prior classes, enabling users to design and adapt the method for their specific reconstruction settings.

2 RELATED WORK

Flow-based 3D dynamics. There is an extensive body of works utilizing flow-based deformations for 3D related problems. For shape interpolation, (Eisenberger et al., 2019) considers volumepreserving flows. For dynamic geometry reconstruction, (Niemeyer et al., 2019) suggests learning neural parametrizations of velocity fields. This representation is further improved by augmenting it with a canonicalized object space parameterization (Rempe et al., 2020; Ren et al., 2021) or by simultaneously optimizing for 3D reconstruction and motion flow estimation (Vu et al., 2022). Similarly to (Niemeyer et al., 2019), (Du et al., 2021) suggests flow-based representation of dynamic rendering model based on a neural radiance field (Mildenhall et al., 2020). More recently, (Chu et al., 2022; Yu et al., 2023) explores combining a time-aware neural radiance field with a velocity field for modelling fluid dynamics. In contrast to our framework they focus exclusively on recovering the deformation of specific fluids i.e. smoke and not on reconstructing generic non-rigid objects.

Dynamic novel-view rendering models. Neural Radiance Fields (Ne RF) (Mildenhall et al., 2020) is a popular image rendering model combining an implicit neural network with volumetric rendering. Several follow-up works (Pumarola et al., 2021; Park et al., 2021a; Tretschk et al., 2021) explore using Ne RF for non-rigid reconstruction, by optimizing for time-dependent deformations. More recently, several works (Fridovich-Keil et al., 2023; Cao & Johnson, 2023; Wu et al., 2023; Song et al., 2023; Guo et al., 2023) try to address the training and inference inefficiencies of continuous volumetric representations by incorporating planes and grids into a spatio-temporal Ne RF. An alternative to Ne RF, suggesting an explicit scene representation, is the Gaussian Splatting (Kerbl et al., 2023) rendering model. Several works incorporate dynamics with Gaussian Splatting. (Yang et al., 2023) introduce a time-conditioned local deformation network. Similarly, (Wu et al., 2023) also relies on a canonical representation of a scene but further improves efficiency by considering a deformation model based on on k-planes (Fridovich-Keil et al., 2023). Lastly, (Lu et al., 2024) propose the integration of a global deformation model.

Published as a conference paper at ICLR 2025

Given a collection of images, Ft = {It i} M

i=1, captured at T time steps, from M 1 viewing directions, we seek to develop an image-based model for novel-view synthesis that can effectively render new images from unseen viewpoints in any direction d S2 and any time t [t1,t T ]. Since we aim to support several time-dependent elements in a dynamic reconstruction model, we employ a general notation for a dynamic image model. That is,

t Ψt = {ψ(t) ψ R+ V }, (1)

with Ψt representing the evaluation at time t of all of the model components. Each element function ψ R+ V , where V is a vector space, can specify any time-dependent quantity specified by the model. V denotes a different vector space depending on what ψ models. For instance, if ψ models time-dependent image pixels RGB color, V = C1(Rd) = {f f Rd R3, f exists and continuous} with d = 2. Whereas, if ψ models the time-dependent position of n particles representing the underlying scene geometry, V = Rn d with d = 3. Lastly, in what follows, we interchangeably switch between the notations ψ(t) and ψt.

The common scheme to learning Ψt involves supervising the model s image predictions at the given timestamps to reconstruct the input images Ft. The specific details of the time-dependent reconstruction model Ψt and the reconstruction loss are deferred to Section 5. We begin first by introducing our proposed framework for incorporating priors through velocity fields.

3.1 VELOCITY FIELDS

We consider a velocity field to be a time-dependent function of the form:

v Rd R+ Rd, (2)

where usually d = 3 or d = 2. A velocity field defines a time-dependent deformation in space ϕt Rd Rd, also known as a flow, via an Ordinary Differential Equation (ODE):

tϕt(x) = v(ϕt(x),t)

ϕ0(x) = x. (3)

Flow-based deformations are an ubiquitous modeling tool (Rezende & Mohamed, 2015; Chen et al., 2018) that has been extensively used in various dynamic reconstruction tasks (Niemeyer et al., 2019; Du et al., 2021). In a dynamic reconstruction model, a flow deformation can be incorporated by defining a time-dependent function ψt Rd R as a push-forward of some reference function ψ0, i.e., ψt = ϕt ψ0. A key advantage of a flow-based deformation model is that its generating velocity field often admits simple characterizations, facilitating the integration of priors into the model. For example, restricting ϕt to be volume-preserving can be achieved by imposing the constraint div(v) = 0 (Eisenberger et al., 2019).

However, recovering ψt values in the case ψt = ϕt ψ0 is not explicit. Typically, this is achieved by solving the continuity equation 1

tψt(x) + div (ψt(x)vt (x)) = 0, x Rd, (4)

which necessitates a numerical simulation. This introduces computational challenges for training flow-based models, as errors in the numerical simulation can destabilize the optimization process. Therefore, to overcome this hurdle, our framework assumes a reconstruction model consisting of functions ψt that are simulation-free, i.e., each evaluation of ψt requires only a single step. However, since we advocate for a deformation prior class formulation based on velocity fields, a key challenge lies in controlling a simulation-free ψt to adhere to such a prior. Addressing this challenge is a central aspect of our framework, as outlined in the following section.

1Assuming ψt obeys a conservation law, where v continuously deforms ψt.

Published as a conference paper at ICLR 2025

3.2 FLOW REMATCHING

We assume that for a time-dependent reconstruction function, ψt Ψt, there exists an underlying flow ϕt such that ψt can be described as a push-forward by ϕt. We refer to ϕt as the reconstruction flow and denote its generating velocity field by vt. Under our assumption that ψt is simulation-free, neither ϕt nor vt is directly accessible. Nevertheless, for now, we assume access to an element vt V, where V represents the set of all the possible reconstruction-generating velocity fields. This assumption will be relaxed later. Let P {ut u Rd R+ Rd} be a prior class of velocity fields to which vt should belong. In Section 4, we discuss various choices for the prior class P.

In some of the choices for P, requiring vt P could be over-restrictive, conflicting with the fact that vt also adheres to generate the reconstruction flow. Hence, an appealing objective would be to optimize vt so that it is the closest element to P out of the set V. We suggest an optimization procedure mimicking the alternating projections method (APM) (Deutsch, 1992). The APM is an iterative procedure where alternating orthogonal projections are performed between two closed Hilbert sub-spaces V and P. Specifically, vk+1 = proj V (proj P (vk)) guarantees the convergence of vk to dist(V,P). Following this concept, our next step is to find a suitable notion for defining the projection operator for reconstruction generating velocity fields.

Since vt is unknown in our case, we suggest leveraging the continuity equation (4), which provides both a sufficient and a necessary condition for the generating velocity field of ϕt in terms of ψt and its partial derivatives. Specifically, we introduce a projection procedure to recover the closest element in the prior class to the reconstruction flow by solving the following matching optimization problem:

u( ,t) = arg min ut P ρ(ut,ψt), (5)

tψt(x) + div (ψt(x)ut(x))

This procedure is illustrated in the right inset, where ut (red dot) is the closest point to vt on P. Notably, neither ϕt nor vt appear in equation 5, aligning with our objective of relying solely on ψt.

Following the alternating projections concept, the matched ut should be projected back onto V to propose a better candidate for v. This corresponds to a flow matching problem in ut. We refer to this procedure as Re Matching and introduce the flow Re Matching loss, LRM, a matching loss striving for the reconstruction flow to match ut. That is, LRM(θ) = Et U[0,1]ρ(ut,ψt) (7)

where ut is a solution of equation 5 and θ denotes solely the parameters of ψt.

Algorithm 1 Re Matching loss

Require: Solver for 5, times {tl} LRM = 0 for t {tl} do

ut( ) solve(ρ,ψt( )) LRM LRM + ρ(ut( ),ψ) end for Return: LRM

The Re Matching loss is designed to supplement a reconstruction loss LREC on the model parameters θ. Thus, our framework s final loss for dynamic reconstruction training is

L(θ) = LREC(θ) + λLRM(θ) (8)

where λ > 0 is a hyper-parameter. Note that in practice, the integral in equation 7 is approximated by a sum using random samples {tl} U[0,1]. In addition, for the Re Matching procedure to be seamlessly incorporated into a reconstruction training process, it is essential that problem 5 can be solved efficiently. To support this, in section 5 we provide linear constructions for P that are sufficiently expressive to encompass the considered prior classes while allowing efficent solutions to equation 5. Algorithm 1 summarises the details of computing 7. Note that calculating θLRM does not necessarily require the cumbersome calculation of θ arg minut P ρ(u( ,t),ψt), since according to Danskin s theorem (Madry, 2017), θρ(ut,ψt) = θ minut P ρ(ut,ψt). Additional implementation details regarding the losses can be found in the Appendix.

Published as a conference paper at ICLR 2025

4 FRAMEWORK INSTANCES

This section presents several instances of the Re Matching framework discussed in this work. One notable setting is when V = Rn d, i.e., ψt = (γ1 t , ,γn t )T , where each γi R+ Rd. In this case, equation 6 becomes:

ρ(ut,ψt) = n i=1 ut(γi t) d

Details of this derivation can be found in section 8.1.2. For the settings where V = C1(Rd), equation 6 involves the computation of a spatial integral, which can be approximated by sampling a set of points {xi}n i=1. Moreover, taking into account that all prior classes incorporated in this work are divergence-free, equation 6 becomes:

ρ(ut,ψt) = n i=1

tψt(xi) + ψt(xi),ut(xi)

since div (ψt(x)ut(x))) = xψt(x),ut(x) + ψt(x)div ut(x).

We now formulate several useful prior classes of velocity fields P. A key feature of all the following constructions is their reliance on linear parameterizations, capitalizing on the fact that linear subspaces are sufficiently expressive to represent the velocity-based prior classes considered. This approach enables the use of efficient solvers for problem 5, reducing the computational task to solving a system of d linear equations, with a run-time complexity of at most O(n).

4.1 PRIOR DESIGN

Directional restricted deformation. In certain scenarios, it is safe to assume that the reconstruction flow can only deform along specific directions. For example, in an indoor scene, where furniture is placed on the floor, deformations would typically occur only in directions parallel to the floor plane. Let v span{v1, ,vl}, 1 l d and {v1, ,vl} is a predefined orthonormal basis in which the flow remains static. Then, the prior class becomes:

PI = {ut u(x,t),vm = 0, m [l]}. (11)

When considering the matching minimization problem 5 in the settings of equation 9, we get:

n i=1 ut(γi t) d

2 = n i=1 V T d

where V = [v1, .vl]. For the settings involving equation 10, the matching minimization problem is solved by:

tψt(xi) + ψt(xi),ut(xi)

tψt(xi)2 (1 ψt(xi),V ψt(xi)

(13) where V = (I V V T ).

Rigid deformation. One widely used prior in the dynamic reconstruction literature is rigidity, i.e., objects in a scene can only be deformed by a rigid transformation consisting of a translation and an orthogonal transformation. In a simple case, where it is assumed that the underlying dynamics consists of one rigid motion, the reconstruction flow would be of the form

γ(t) = R(t)x0 + b(t) (14)

with R(t) O(3) and b(t) R3. Differentiating γ and solving for x0 yields that

d dtγ(t) = R(t)RT (t)(γ(t) b(t)) + b(t). (15)

Since R(t)RT (t) is a skew-symmetric matrix, we suggest the following natural parameterization for the prior class

PII = {ut u(x,t) = Atx + bt,At Rd d,At = AT t ,bt Rd}. (16)

Published as a conference paper at ICLR 2025

Substituting PII in problem 5 using equation 9 yields the following minimization problem:

min (At,bt)

n i=1 Atγi t + bt d

2 s.t. At = AT t . (17)

For the settings involving equation 10, the minimization problem 5 becomes:

min (At,bt)

tψt(xi) + ψt(xi),Atxi + bt

2 s.t. At = AT t . (18)

Importantly, both 17 and 18 are constrained least-squares problems. Thus, as detailed in Lemma 1, they enjoy an analytic solution that can be computed efficiently.

Volume-preserving deformation. So far we have only covered prior classes that may be too simplistic for capturing complex real-world dynamics. To address this, a reasonable assumption would be to include deformations that preserve the volume of any subset of the space. Notably, the rigid deformations prior class discussed earlier strictly falls within this class as well. Interestingly, volume-preserving flows are characterized by being generated via a divergence-free velocity field, i.e., div u = 0. To this end, we propose the following prior class:

PIII = ut ut(x) = k j=1 βjbj(x),β = [β1, ,βk]T Rk , (19)

where for each basis bj Rd Rd, we assume that div(bj) = 0. Clearly, div(ut) = 0 for any choice of β Rk. Taking into account that div curlu = 0, we follow Eisenberger et al. (2019), and incorporate the following basis functions:

bj(x) {curl(ϕj(x)e T 1 ), ,curl(ϕj(x)e T d )}, (20)

where ϕj Rd R, ϕj(x) = d l=1 sin(jlπe T l x) with jl N denoting the frequency for the lth

coordinate of the jth basis function. Combining this prior with equation 9, yields the following minimization problem:

XXXXXXXXXXX

k j=1 βjbj(γi t) d

dtγi t XXXXXXXXXXX

Similarly, for the case of equation 10, we get:

RRRRRRRRRRR

tψt(xi) + ψt(xi), k j=1 βjbj(xi) RRRRRRRRRRR

In particular, both minimization problems of 21 and 22 correspond to a standard least-squares problem and have an analytic solution that can be efficiently computed. A key decision involved in using the prior class PIII is to select the number of basis functions k. However, setting k equal to a large value would make PIII overly permissive, effectively neutralizing the Re Matching loss. To address this, in what follows, we propose an additional procedure for constructing more complex prior classes, based on an adaptive choice of complexity level.

Figure 1: A vector field in PV .

Adaptive-combination of prior classes. To address the challenge of setting the complexity level of the prior class, we introduce an adaptive (learnable) construction scheme for a prior class. Let wj(x,t) Rd R+ R, 1 j k, be learnable functions, which are part of the reconstruction model, i.e., wj( ,t) Ψt and wj are normalized, i.e., k j=1 wj(x,t) = 1. The details of wj architecture are left to Section 5. We can construct a complex prior class by assigning simpler prior classes to different parts of the space, according to the weights wj. For example, let us consider a piece-wise rigid deformation prior class defined as:

PIV = ut u(x,t) = k j=1 wj(x,t)uj(x,t),uj PII for 1 j k . (23)

Published as a conference paper at ICLR 2025

In a similar manner, we can also combine PI with rigid deformations and derive a prior class defined as:

PV = ut u(x,t) = k j=1 wj(x,t)uj(x,t),u1 PI,uj PII for 2 j k . (24)

Figure 1 illustrates an element in PV , with weights wj dividing the plane to a restricted up direction deformation above the diagonal, and a rigid deformation below the diagonal. Note that directly substituting an adaptive-combination prior class in 5 would no longer yield a linear problem. Therefore, we propose to use a linear problem that upper bounds the matching optimization problem of 5. For example, in the case of equation 9 with PIV , we can solve:

min {(Ajt,bjt)}

k j=1 wj(γi t,t) Ajtγi t + bjt d

2 s.t. Ajt = Ajt T . (25)

Using Jensen s inequality, it can be seen that 25 upper bounds the matching optimization from 5. Now, problem 25 can be solved efficiently, as it corresponds to a weighted least squares problem that is solvable in parallel for each j [k], similarly to problem 17.

Lastly, incorporating PII into equation 5 results in a non-standard least-squares problem with a constraint. The following lemma, with its proof provided in the Appendix, formulates the solutions for using PIV in equation 5, covering problems 17 and 18 as a special case. Lemma 1. For the prior class PIV , the solutions (Ajt,bjt) to the minimization problem 25 are given by,

[ vech(Ajt) bjt ] = P 1 jt [ vec( ΓT t WjtΓjt ΓT jt Wjt Γt) 1 1T Wjt11T Wjt Γt ]

where Γjt = [γ1 t , ,γn t ] T Rn d, Γt = [ d

dtγ1 t , , d

dtγn t ] T Rn d, Wjt = n i=1 wj(γi t,t)eie T i with {ei} as the standard basis in Rn, vech(Ajt) R d(d 1)

2 denotes the half-vectorization of the anti-symmetric matrix Ajt, and the matrix P 1 jt depends solely on Γjt, Γt, and Wjt.

The solutions (Ajt,bjt) to the minimization problem 5 with 10 are given by,

[ vech(Ajt) bjt ] = P 1 jt [ n i=1 wj(xi,t)vec(si(xi[gi t]T gi tx T i )) GT t Wjts ]

where gi t = [ ψt(xi)]T , Gt = [g1 t , ,gn t ] T Rn d, si =

tψt(xi), s = [s1 t, ,sn t ] T Rn.

5 IMPLEMENTATION DETAILS

In this section, we provide additional details about the dynamic image model Ψt employed in this work, based on Gaussian Splatting (Kerbl et al., 2023). We provide an overview of this image model, followed by details about the dynamic model used in the experiments.

Gaussian Splatting image model. The Gaussian Splatting image model is parameterized by a collection of n 3D Gaussians augmented with color and opacity parameters. That is, θ = {µi,Σi,ci,αi} n

i=1 with µi R3 denoting the ith Gaussian mean, Σi R3 3 its covariance matrix, ci R3 its color, and αi R its opacity. To render an image, the 3D Gaussians are projected to the image plane to form a collection of 2D Gaussians parameterized by {µi 2D,Σi 2D}. Given K,E denoting the intrinsic and extrinsic camera transformations, the image plane Gaussians parameters are calculated using the point rendering formula:

µi 2D = K Eµi

(Eµi)z ,Σi 2D = JEΣi ET JT (26)

where J denotes the Jacobian of the affine transformation of 26. Lastly, an image pixel I(p) is obtained by alpha-blending the ordered by depth visible Gaussians:

I(p) = n i=1 ciαiσi (p) i 1 j=1 (1 αjσj (p)), (27)

where σi(p) = exp( 1

2 (p µi 2D) T (Σi 2D) 1 (p µi 2D)).

Published as a conference paper at ICLR 2025

Dynamic image model. We utilize the Gaussian Splatting image model to construct our dynamic model as: Ψt = {µi + µi(t),Σ + Σi(t),ci,αi,wij(t)} n

where µi(t) = fµ(µi,t), Σi(t) = fΣ(µi,t), wij(t) = e T j softmax(fw(µi + µi(t),µi,t)). We follow Yang et al. (2023) and each of the functions: fµ R3 R R3, fΣ R3 R R6, fw R3 R3 R Rk is a Multilayer perceptron (MLP). For more details regarding the MLP architectures, we refer the reader to the Appendix. Note that the model element wij(t) is only relevant to instances where the adaptive-combination prior class is assumed. Lastly, in our experiments we apply the Re Matching loss for µi + µi(t), and for time-dependent rendered images It.

Training details. We follow the training protocol of (Yang et al., 2023). We initialize the model using n = 100K 3D Gaussians. Training is done for 40K iterations, where for the first 3K iterations, only {µi,Σi,ci,αi} n

i=1 are optimized. In instances where the adaptive-combination prior class is applied, we supplement the Re Matching optimization objective with an entropy loss on the weights wij as follows:

Lentropy = 1

n i=1 wij log ( 1

n i=1 wij). (29)

Lastly, for all the experiments considered in this work, we set the Re Matching loss weight λ = 0.001.

6 EXPERIMENTS

We evaluate the Re Matching framework on benchmarks involving synthetic and real-world video captures of deforming scenes. For quantitative analysis in both cases, we report the PSNR, SSIM (Wang et al., 2004)and LPIPS (Zhang et al., 2018) metrics. Additional evaluations, including experiments on more synthetic and real-world datasets, hyperparameter ablation, and the framework s applicability to an alternative dynamic image model, are provided in the Appendix.

D-Ne RF synthetic. D-Ne RF dataset (Pumarola et al., 2021) comprises of 8 scenes, each consisting from 100 to 200 frames, hence providing a dense multi-view coverage of the scene. We follow D-Ne RF s evaluation protocol and use the same train/validation/test split at 800 800 image resolution with a black background. In terms of baseline methods, we consider recent state-of-the-art dynamic models, including Deformable 3D Gaussians (D3G) (Yang et al., 2023), 3D Geometry-aware Deformable Gaussians (GA3D) (Lu et al., 2024), Neural Parametric Gaussians (NPG) (Das et al., 2024), and K-Planes (Fridovich-Keil et al., 2023). Note that some of these baselines incorporate prior regularization losses such as local rigidity and smoothness to their optimization procedure. Table 1 summarizes the average image quality results for unseen frames in each scene. We include two variants of our framework: i) Using the divergence-free prior PIII; and ii) Using the adaptivecombination prior class PIV or the class PV specifically for scenes that include a floor component. Figure 2 provides a qualitative comparison of rendered test frames, highlighting the improvements of our approach, which: i) produces plausible reconstructions that avoid unrealistic distortions, e.g., the human fingers in the jumping jacks scene; ii) reduces rendering artifacts of extraneous parts, especially in moving parts such as the leg in the T-Rex scene.

Bouncing Balls Hell Warrior Hook Jumping Jacks

Method LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM

K-Planes (Fridovich-Keil et al., 2023) 0.0242 37.78 0.9929 0.1074 32.57 0.9316 0.0655 29.46 0.9481 0.0417 31.73 0.9715 NPG (Das et al., 2024) 0.0537 38.68 0.9780 0.0460 33.39 0.9735 0.0345 33.97 0.9828 GA3D (Lu et al., 2024) 0.0093 40.76 0.9950 0.0210 41.30 0.9871 0.0124 37.78 0.9887 0.0121 37.00 0.9887 D3G (Yang et al., 2023) 0.0089 41.52 0.9978 0.0261 41.28 0.9928 0.0165 37.03 0.9906 0.0137 37.59 0.9930 Ours (PIII) 0.0087 41.84 0.9979 0.0244 41.59 0.9932 0.0161 37.19 0.9909 0.0134 37.72 0.9931 Ours (PIV or PV ) 0.0089 41.61 0.9978 0.0245 41.69 0.9977 0.0158 37.39 0.9911 0.0131 38.01 0.9934

Lego Mutant Stand Up T-Rex

Method LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM

K-Planes (Fridovich-Keil et al., 2023) 0.0472 25.15 0.9431 0.0215 35.30 0.9825 0.0211 36.55 0.9831 0.0284 30.41 0.9778 NPG (Das et al., 2024) 0.0716 24.63 0.9312 0.0311 36.02 0.9840 0.0257 38.20 0.9889 0.0310 32.10 0.9959 GA3D (Lu et al., 2024) 0.0446 24.87 0.9420 0.0050 42.39 0.9951 0.0062 43.96 0.9948 0.0100 37.70 0.9929 D3G (Yang et al., 2023) 0.0453 24.93 0.9537 0.0066 42.09 0.9966 0.0083 43.85 0.9970 0.0105 37.89 0.9956 Ours (PIII) 0.0503 24.89 0.9522 0.0067 42.13 0.9966 0.0085 43.99 0.9969 0.0105 38.07 0.9958 Ours (PIV or PV ) 0.0456 24.95 0.9537 0.0065 42.40 0.9968 0.0081 44.31 0.9971 0.0103 38.38 0.9961 Table 1: Image quality evaluation on unseen frames for the D-Ne RF dataset (Pumarola et al., 2021).

Published as a conference paper at ICLR 2025

Figure 2: Qualitative comparison of baselines and our model on the D-Ne RF dataset (Pumarola et al., 2021). We note that our framework consistently produces high fidelity reconstructions, accurately capturing fine-grained details, as highlighted in the blue boxes.

Hyper Ne RF real-world. The Hyper Ne RF dataset (Park et al., 2021b) consists of real-world videos capturing a diverse set of human activities involving interactions with common objects.

Scene LPIPS PSNR SSIM

Slice Banana

GA3D 0.4160 25.34 0.6722 D3G 0.3692 24.87 0.7935 Ours (PIII) 0.3829 25.08 0.7992 Ours (PIV ) 0.3673 25.28 0.8025

GA3D 0.4721 25.13 0.7555 D3G 0.3030 26.66 0.8813 Ours (PIII ) 0.2987 26.74 0.8836 Ours (PIV ) 0.3044 26.80 0.8835

GA3D 0.3252 28.37 0.7596 D3G 0.2858 28.65 0.8873 Ours (PIII) 0.2760 27.91 0.8842 Ours (PIV ) 0.2675 28.30 0.8883

GA3D 0.3278 23.79 0.8174 D3G 0.2340 25.41 0.9207 Ours (PIII) 0.2221 26.00 0.9251 Ours (PIV ) 0.2260 25.62 0.9229

Split Cookie

GA3D 0.1144 32.28 0.9290 D3G 0.0971 32.61 0.9657 Ours (PIII) 0.1097 31.31 0.9600 Ours (PIV ) 0.0937 32.67 0.9667 Table 2: Unseen frames evaluation for the Hyper Ne RF dataset (Park et al., 2021b).

We follow the evaluation protocol provided with the dataset, and use the same train/test split. In table 2 we report image quality results for unseen frames on 5 scenes from the dataset: Slice Banana, Chicken, Lemon, Torch, and Split Cookie. Figure 3 shows qualitative comparison to the baseline D3G (Yang et al., 2023). Our approach demonstrates similar types of improvements as noticed in the synthetic case providing more realistic reconstructions, especially in areas involving deforming parts.

Re Matching time-dependent image. In this experiment we validate the applicability of the Re Matching loss for controlling model solutions via rendered images. To that end, we apply our framework with the PIII prior class to the Jumping Jacks scene from D-Ne RF on a single specific front view through time. The qualitative comparison to D3G (Yang et al., 2023), as shown in the Appendix, supports the benefits of prior integration in this case as well, demonstrating more plausible reconstructions in areas involving moving parts.

Figure 4: Part assignments for the adaptive-combination prior class.

Adaptive-combination prior class. Employing the adaptivecombination prior classes PIV and PV with learnable parts assignments {wij} raises the question of whether the learning process successfully produced assignments {wij} that align with the scene segmentation based on its deforming parts. Figure 4 shows our results for test frames from the Bouncing-Balls and Lego synthetic scenes (left), and the Chicken real-world scene (right). For comparison, we include the results of the Segment Anything Model (SAM) (Kirillov et al., 2023), which are mostly influenced by color variations. Consequently, SAM often over-segments the scene or incorrectly merges independently moving parts. See the supplementary material for additional segmentation results.

Published as a conference paper at ICLR 2025

Figure 3: Qualitative comparison of our method to D3G (Yang et al., 2023) on the Hyper Ne RF dataset (Park et al., 2021b). Our framework yields more accurate reconstructions, in particular around moving parts.

7 CONCLUSIONS

We presented the Re Matching framework for integrating prior deformation classes into dynamic reconstruction models. In addition to offering useful constructions for velocity-field-based prior classes, a key focus of this work was the development of the Re Matching loss. This loss function optimizes for solutions that remain as close as possible to the desired prior class, rather than strictly enforcing membership, thereby minimizing the tradeoff between fidelity and the advantages of using priors. Our experimental results validate this approach, showing that Re Matching solutions successfully adhere to the desired prior while achieving high-fidelity reconstruction. We believe that the generality with which the framework was formulated will enable broader applicability and that future advancements in this framework could be relevant to a wide range of dynamic reconstruction models. An interesting avenue for future research is the development of velocity-field-based prior classes emerging from video generative models, potentially using our Re Matching formulation for time-dependent image pixels color. Another promising direction is the design of richer prior classes to handle more complex physical phenomena, such as ones including liquids and gases.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS

The authors would like to thank Jonathan Lorraine and Heli Ben-Hamu for their insightful discussions and valuable comments.

Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions, 2023. URL https://arxiv.org/abs/2303.08797.

Heli Ben-Hamu, Samuel Cohen, Joey Bose, Brandon Amos, Aditya Grover, Maximilian Nickel, Ricky TQ Chen, and Yaron Lipman. Matching normalizing flows and probability paths on manifolds. ar Xiv preprint ar Xiv:2207.04711, 2022.

Ang Cao and Justin Johnson. Hexplane: A fast representation for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 130 141, 2023.

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.

Mengyu Chu, Lingjie Liu, Quan Zheng, Erik Franz, Hans-Peter Seidel, Christian Theobalt, and Rhaleb Zayer. Physics informed neural fields for smoke reconstruction with sparse data. ACM Trans. Graph., 2022.

Devikalyan Das, Christopher Wewer, Raza Yunus, Eddy Ilg, and Jan Eric Lenssen. Neural parametric gaussians for monocular non-rigid object reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10715 10725, 2024.

Frank Deutsch. The method of alternating orthogonal projections. In Approximation theory, spline functions and applications, pp. 105 121. Springer, 1992.

Yilun Du, Yinan Zhang, Hong-Xing Yu, Joshua B Tenenbaum, and Jiajun Wu. Neural radiance flow for 4d view synthesis and video processing. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 14304 14314. IEEE Computer Society, 2021.

Marvin Eisenberger, Zorah L ahner, and Daniel Cremers. Divergence-free shape correspondence by deformation. In Computer Graphics Forum, volume 38, pp. 1 12. Wiley Online Library, 2019.

Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp. 12479 12488. IEEE, 2023.

Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In CVPR, 2023.

Haoyu Guo, Sida Peng, Yunzhi Yan, Linzhan Mou, Yujun Shen, Hujun Bao, and Xiaowei Zhou. Compact neural volumetric video representations with dynamic codebooks. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), July 2023. URL https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/.

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ar, and Ross Girshick. Segment anything. ar Xiv:2304.02643, 2023.

Huan Ling, Seung Wook Kim, Antonio Torralba, Sanja Fidler, and Karsten Kreis. Align your gaussians: Text-to-4d with dynamic 3d gaussians and composed diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 2024.

Published as a conference paper at ICLR 2025

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. ar Xiv preprint ar Xiv:2210.02747, 2022.

Stephen Lombardi, Tomas Simon, Gabriel Schwartz, Michael Zollh ofer, Yaser Sheikh, and Jason M. Saragih. Mixture of volumetric primitives for efficient neural rendering. ACM Trans. Graph., 40 (4):59:1 59:13, 2021.

Zhicheng Lu, Xiang Guo, Le Hui, Tianrui Chen, Min Yang, Xiao Tang, Feng Zhu, and Yuchao Dai. 3d geometry-aware deformable gaussian splatting for dynamic view synthesis. ar Xiv preprint ar Xiv:2404.06270, 2024.

Aleksander Madry. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.

Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Occupancy flow: 4d reconstruction by learning particle dynamics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5379 5389, 2019.

Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B. Goldman, Steven M. Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, 2021a.

Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T. Barron, Sofien Bouaziz, Dan B. Goldman, Ricardo Martin-Brualla, and Steven M. Seitz. Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. Co RR, abs/2106.13228, 2021b. URL https: //arxiv.org/abs/2106.13228.

Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10318 10327, 2021.

Davis Rempe, Tolga Birdal, Yongheng Zhao, Zan Gojcic, Srinath Sridhar, and Leonidas J. Guibas. Caspr: Learning canonical spatiotemporal point cloud representations. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

Zhongzheng Ren, Xiaoming Zhao, and Alex Schwing. Class-agnostic reconstruction of dynamic objects from videos. Advances in Neural Information Processing Systems, 34:509 522, 2021.

Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pp. 1530 1538. PMLR, 2015.

Liangchen Song, Anpei Chen, Zhong Li, Zhang Chen, Lele Chen, Junsong Yuan, Yi Xu, and Andreas Geiger. Nerfplayer: A streamable dynamic scene representation with decomposed neural radiance fields. IEEE Transactions on Visualization and Computer Graphics, 29(5):2732 2742, 2023.

Olga Sorkine and Marc Alexa. As-rigid-as-possible surface modeling. In Symposium on Geometry processing, volume 4, pp. 109 116. Citeseer, 2007.

Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollh ofer, Christoph Lassner, and Christian Theobalt. Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, 2021.

Tuan-Anh Vu, Duc Thanh Nguyen, Binh-Son Hua, Quang-Hieu Pham, and Sai-Kit Yeung. Rfnet-4d: Joint object reconstruction and flow estimation from 4d point clouds. In Shai Avidan, Gabriel J. Brostow, Moustapha Ciss e, Giovanni Maria Farinella, and Tal Hassner (eds.), Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXIII, 2022.

Published as a conference paper at ICLR 2025

Chaoyang Wang, Peiye Zhuang, Aliaksandr Siarohin, Junli Cao, Guocheng Qian, Hsin-Ying Lee, and Sergey Tulyakov. Diffusion priors for dynamic view synthesis from monocular videos. Co RR, abs/2401.05583, 2024.

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600 612, 2004.

Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. ar Xiv preprint ar Xiv:2310.08528, 2023.

Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. ar Xiv preprint ar Xiv:2309.13101, 2023.

Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park, and Jan Kautz. Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. Co RR, abs/2004.01294, 2020. URL https://arxiv.org/abs/2004.01294.

Hong-Xing Yu, Yang Zheng, Yuan Gao, Yitong Deng, Bo Zhu, and Jiajun Wu. Inferring hybrid neural fluid fields from videos. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Neur IPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.

Raza Yunus, Jan Eric Lenssen, Michael Niemeyer, Yiyi Liao, Christian Rupprecht, Christian Theobalt, Gerard Pons-Moll, Jia-Bin Huang, Vladislav Golyanik, and Eddy Ilg. Recent trends in 3d reconstruction of general non-rigid scenes. In Computer Graphics Forum, pp. e15062. Wiley Online Library, 2024.

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586 595, 2018.

8.1.1 PROOF OF LEMMA 1

Proof. (Lemma 1)

Let Γt, Γt, Wjt be given. Without loss of generality, we show the proof only for individuals j [k] and t. So in order to ease the notation, in what follows we omit the subscripts t and j. Let A Rd d

and b Rd, and define [w1, ,wn]T = W 1. First, we show that n i=1 wi Aγi + b γi 2 can be reformulated as a weighted norm-squared minimization problem in A and b. That is,

n i=1 wi Aγi + b γi 2 = n i=1 tr A wiγi wiγi T AT (30)

2tr wi( γi b) wiγi AT + tr wi( γi b) wi( γi b)T (31)

= tr AΓT W ΓAT 2tr( Γ 1b T )T W ΓAT + (32)

tr( Γ 1b T )T W ( Γ 1b T ) (33)

W (ΓAT ( Γ 1b T )) 2 . (34)

Next, we consider the following optimization problem:

W (ΓAT ( Γ 1b T )) 2 s.t. A = AT . (35)

Published as a conference paper at ICLR 2025

Use the fact that A = AT to define the following Lagrangian,

W (ΓA 1b T + Γ) 2 + trΛT (A + AT ). (36)

Then, L A = 2ΓT W (ΓA 1b T + Γ) + Λ + ΛT .

A = 0 yields that ΓT W (ΓA 1b T + Γ) is symmetric. Then, using again the fact that A = AT , we get that,

ΓT W ΓA + AΓT W Γ + b1T W Γ ΓT W 1b T = ΓT W Γ ΓT W Γ. (37)

Now, taking the derivative w.r.t. to b gives,

b = 21T W (ΓA 1b T + Γ)

b = 0, yields that, w T ΓA + b T = w T Γ, (38) where w = W 1 1T W 1. Vectorizing the LHS of 37 gives,

(Id ΓT W Γ + ΓT W Γ Id)Ddvech(A) + (ΓT W 1 Id Id ΓT W 1)b (39)

where Dd is the duplication matrix transforming vech(A) to vec(A), with vech(A) denoting the half-vectorization of the anti-symmetric matrix A. Similarly, vectorizing the LHS of 38 yields,

1 1T W 1 (Id 1T W Γ)Ddvech(A) + b. (40)

Based on 39 and 40, we can define the following block matrix:

P = [ Q = Q Dd R = ΓT W 1 Id Id ΓT W 1 S = S Dd T = Id ] (41)

where Q = Id ΓT W Γ + ΓT W Γ Id, and, S = 1 1T W 1 (Id 1T W Γ). Then, let,

U = (Q RT 1S) 1 = Ld (Q RS ) 1 (42)

where Ld is the matrix satisfying Dd Ld = Id2. Consequently,

P 1 = [ U UR S (Q RS ) 1 Id + S (Q RS ) 1 R ] (43)

[ vech(A) b ] = P 1 [ vec( ΓT W Γ ΓT W Γ) w T Γ ]. (44)

Now, for the second part of the lemma. Let gi = [ ψt(xi)]T , si =

tψt(xi). Consider the following energy,

L = n i=1 wi (g T i (Axi + b) + si) 2 . (45)

Note that, g T i Axi = y T i a (46) where a = vec(A), and yi = xi gi. Then,

L = n i=1 wi (a T yiy T i a + b T gig T i b + 2a T yig T i b + 2g T i sib + 2sia T yi + s2 i ). (47)

Published as a conference paper at ICLR 2025

Define the Lagrangian,

L(a,b,λ) = a T i yiwiy T i a + b T GT W Gb + 2a T i yiwig T i b+ (48)

2s T W Gb + 2a T i wiyisi + t T W t + λT (a + Pa) (49)

where P is the permutation matrix s.t. vec(AT ) = Pa.

Then, L a = 2 i wiyiy T i a + 2 i wiyig T i b + 2 i wiyisi + λ + Pλ (50)

Equating the above to 0 and unvectorizing it, yields the following matrix equation,

n i=1 wi(gig T i Axix T i + gig T i bx T i + sigix T i ) = 1

2(Λ + ΛT ), (51)

yielding that the LHS is a symmetric matrix. Therefore,

n i=1 wi(gig T i Axix T i + xig T i bx T i + sigix T i ) = n i=1 wi(xix T i AT gig T i + xib T gig T i + sixig T i ). (52)

Rearranging the above and half-vectorizing both sides yields that,

n i=1 wi(xix T i gig T i + gig T i xix T i )Ddvech(A) + wi(xi gig T i gig T i xi)b = (53)

n i=1 wivec(si(xig T i gix T i )). (54)

yields that, GT W Gb + i wigiy T i Ddvech(A) = GT W s. (56)

P = [ Q = Q Dd R = n i=1 wi(xi gig T i gig T i xi) S = S Dd T = GT W G ], (57)

where S = n i=1 wigiy T i , and Q = n i=1 wi(xix T i gig T i + gig T i xix T i ). Then, let,

U = (Q RT 1S) 1 = Ld (Q RT 1S ) 1 (58)

where Ld is the matrix that satisfies Dd Ld = Id2. Consequently,

P 1 = [ U URT 1

T 1S (Q RT 1S ) 1 T 1 + T 1S (Q RT 1S ) 1 RT 1 ]. (59)

8.1.2 CONTINUITY EQUATION CONSTRAINT DERIVATION FOR V = Rn d

In the main text, we stated that in the case when V = Rn d, i.e., ψt = (γ1 t , ,γn t )T , equation 6 becomes:

ρ(ut,ψt) = n i=1 ut(γi t) d

To see this formally, let δ(x a) denote the Dirac delta generalized function concentrated around a, satisfying δ(x a) = 0, x a, (61)

Published as a conference paper at ICLR 2025

ϕ(x)δ(x a)dx = ϕ(a), (62)

for any test function ϕ. Consider ψt(x) = n i=1 ψi t(x), where ψi t(x) = δ(x γi t). Note that under this definition of ψt, V is in fact the space of generalized functions. Then,

tψi t = δ(x γi t), d

dtγi t , (63)

and, using the chain rule as applied in the simplification of equation 10, we have that,

div ψi tu(x) = δ(x γi t),u(x) + δ(x γi t)div(u(x)). (64)

Substituting these computations in the continuity equation 4, yields that,

tψt(x) + div (ψt(x)vt (x)) dx (65)

tψt(x) + div (ψt(x)vt (x))dx = (66)

δ(x γi t), d

dtγi t + δ(x γi t),u(x) + δ(x γi t)div(u(x))dx = (67)

δ(x γi t),u(x) d

dtγi t dx + δ(x γi t)div(u(x))dx . (68)

Now, under the assumption that div(u) = 0 almost everywhere, using 62 yields that the second term in the last equation vanishes. Therefore,

0 = δ(x γi t),u(x) d

dtγi t dx = (69)

δ(x γi t) log δ(x γi t),u(x) d

dtγi t dx (70)

δ(x γi t) log δ(x γi t) u(x) d

dtγi t dx, (71)

where we applied the Cauchy-Schwarz inequality in the final step. Therefore,

δ(x γi t) log δ(x γi t) u(x) d

dtγi t dx = 0. (72)

Applying property 62 yields that equation 72 can be true only if when x = γi t, we have that,

dtγi t = 0. (73)

Utilizing this constraint for each i, we can derive equation 9.

8.2 ADDITIONAL IMPLEMENTATION DETAILS

8.2.1 ARCHITECTURE

We first describe the construction of the Gaussian Splatting dynamic image model referenced in section 5. An illustration of this model is presented in Figure 5. Figure 6 illustrates the applied losses. The time invariant base of the model is optimized throughout training and consists of the following set of parameters θ = {µi,Si,Ri,ci,αi} n

i=1 with Gaussian mean µi R3, scaling Si R3, rotation quaternion Ri R4, color ci R3 and opacity αi R. The covariance matrix Σi is calculated during the rendering process from the temporally augmented scaling and rotation parameters.

The time dependent deformation model transforms the time invariant Gaussian mean µi and selected time t into the deformation of the mean, scaling, rotation and the model element w in the case of the adaptive-combination prior.

We generate positional embeddings (Mildenhall et al., 2020) of the time and mean inputs, which we pass to the deformation model Multilayer perceptrons.

Embtime(t) R Rdtime emb

Published as a conference paper at ICLR 2025

Figure 5: Illustration of the architecture for Ψt used in the experiments, based on (Yang et al., 2023). Reference Gaussians parameters are propagated to time t through a shared function, ψ1 t , implemented as an MLP with positional encoding features to compute time-varying point features of dimension df. These features are then processed by a second shared function, ψ2 t , to generate time-varying Gaussians parameters. Finally, given a chosen viewing direction, the Gaussian Splatting rendering model is used to produce a rendered image.

Figure 6: Illustration of the losses applied for learning Ψt. The reconstruction loss, LRec is applied on a rendered image to recover a corresponding given image of the scene. The Re Matching loss, LRM, can be applied either on a rendered image or time-varying Gaussians positions.

Embmean(µi) R3 Rdmean emb

The deformation model is made up of layers of the form:

ψ(n,din,dout) X ν (XW + 1b T )

where ν = Softplusβ, with β = 100.

For the deformation of the mean, scaling and rotation, the model takes the same form with minor differences in the final layer depending on the deforming parameter.

Embtime(t) ψ(n,dtime emb,256) ψ(n,256,dτ) τ

[τ,Embmean(µ)] ψ(n,dτ + dmean emb,256) ψ(n,256,256) ψ(n,256,256) ψ(n,256,256) [τ,Embmean(µ),ψ(n,256,256)] ψ(n,dτ + dmean emb + 256,256) ψ(n,256,256)] ψ(n,256,256)] ω

Mean ω ψ(n,256,3) µ(t) Scaling ω ψ(n,256,3) S(t) Rotation ω ψ(n,256,4) R(t)

For the prediction of the w we use a shallower Multilayer perceptron.

[τ,Embmean(µ + µ(t)),Embmean(µ)] ψ(n,dτ + 2 dmean emb,256) ψ(n,256,K) Softmax w(t)

Published as a conference paper at ICLR 2025

8.2.2 HYPER-PARAMETERS AND TRAINING DETAILS

We set dmean emb = 63, dtime emb = 13 and dτ = 30. For optimization we use an Adam optimizer with different learning rates for the network components, keeping the hyper-parameters of the baseline model (Yang et al., 2023).

In the case of the adaptive-combination prior we select k based on a hyper-parameter search between 1 and 35. The optimal value for most scenes ranges between 5 and 15, though the number also depends on the selected composition of priors. For example, a single volume-preserving class can supervise multiple moving objects as opposed to a single rigid deformation class. We use the Re Matching loss weight λ = 0.001. When supplementing the Re Matching loss with an additional entropy loss, we use 0.0001 as the entropy loss weight.

In calculating the partial derivatives of ψt , we note that the input dimension for predicting ψt is relatively small 1 for time or d for spatial coordinates compared to the output dimension, which can be n d for spatial Re Matching or H W in image-space Re Matching. Given this, forward-mode automatic differentiation proves to be more efficient than backward-mode differentiation for this specific computation, both computationally and in terms of memory usage. Consequently, we utilize forward-mode autodiff to compute the partial derivatives of ψt required for the Re Matching loss. Once the Re Matching loss is incorporated, we employ backward-mode autodiff to compute the gradient of the overall loss with respect to the model parameters.

8.2.3 REMATCHING RENDERED IMAGE

The reconstruction model architecture in the case of the image Re Matching is the same as for the other experiments. At initialization we select a fixed viewpoint for the evaluation of the image space loss, which is kept throughout training. At every iteration we sample a random time and evaluate the Re Matching loss from the fixed viewpoint.

For approximating equation 10, we calculate a sample by choosing points that their image value is close to 0 after applying the following transformation:

f(x) = 0.1 ln(1 x ) sign(x) (74)

on the image.

Next, we compute the image gradient using automatic differentiation and use our single class div-free solver to reconstruct the flow and calculate the loss.

8.3 ADDITIONAL EVALUATION

8.3.1 3D GEOMETRY-AWARE DEFORMABLE GAUSSIANS

We provide an additional set of experiments that show the versatility of our framework. Our evaluations of the framework in tables 1 and 2 were made by applying our framework to a reconstruction model inspired by Deformable 3D Gaussians (Yang et al., 2023). In this evaluation, we apply the same framework on a baseline following the 3D Geometry-aware Deformable Gaussians (Lu et al., 2024). Table 3 shows our results compared to the baseline GA3D.

8.3.2 DYNAMIC SCENES DATASET

To further evaluate the efficacy of the Re Matching framework in practical applications, we consider the Dynamic Scenes dataset (Yoon et al., 2020), which captures forward-facing views of real-world scenes exhibiting complex dynamics. To that end, we selected 4 scenes from the Human, Interaction, and Vehicle categories, consisting of monocular videos with approximately 80 180 frames for training and an additional 20 frames reserved for testing. Figure 8 shows qualitative comparison to the D3G (Yang et al., 2023) baseline, highlighting similar patterns of improvement as observed in earlier experiments. Specifically, our approach better preserves fine details, such as the truck s front lights (Truck) and the bottom teeth (Dynamic Face). Additionally, it demonstrates a reduction in reconstruction motion artifacts, as evident in the humans in motion (Jumping) and the legs and head of the dinosaur (Balloon). Table 4 presents a quantitative evaluation, comparing two variants from

Published as a conference paper at ICLR 2025

Figure 7: Qualitative comparison of the Re Matching loss applied in the image space. Each group of 3 is showing Ground-Truth (left), Ours (center), and D3G (right).

our prior classes, PIII and PIV to D3G. These results correlate with the qualitative improvements discussed above.

8.4 RECONSTRUCTION FLOW EVALUATION

In this section, we evaluate the ability of the Re Matching framework to recover the underlying reconstruction flow ϕt. Since the reconstruction flow is generally unknown, we use the following simple flow to generate training data:

ϕt(x) = R(t)x (75)

where RT (t)R(t) = Id. We evaluate our framework in its two settings: i) equation 9, corresponding to V = Rn d; and, ii) equation 10, corresponding to V = C1(Rd).

The V = Rn d case. We evaluate these settings for d = 3, following a similar approach to the dynamic image model based on Gaussian Splatting described in Section 5. To construct the training set, we use a reference scene of a colored box and apply the ground-truth flow ϕt, to create a dynamic scene consisting of multi-view captures over 12 time stamps. For the flow ϕt, we set

Bouncing Balls Hell Warrior Hook Jumping Jacks

Method LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM

GA3D (Lu et al., 2024) 0.0063 43.42 0.9960 0.0175 32.02 0.9827 0.0076 36.88 0.9915 0.0092 37.83 0.9924 Ours (PIII) 0.0061 43.58 0.9961 0.0165 32.15 0.9834 0.0075 36.92 0.9914 0.0093 37.84 0.9927 Ours (PIV or PV ) 0.0058 43.62 0.9962 0.0163 32.21 0.9835 0.0077 37.01 0.9916 0.0078 38.18 0.9931

Lego Mutant Stand Up T-Rex

Method LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM

GA3D (Lu et al., 2024) 0.0328 25.41 0.9471 0.0029 41.56 0.9969 0.0028 42.40 0.9967 0.0055 38.65 0.9950 Ours (PIII) 0.0316 25.53 0.9489 0.0029 41.41 0.9968 0.0028 42.24 0.9967 0.0052 39.12 0.9952 Ours (PIV or PV ) 0.0320 25.50 0.9487 0.0028 41.65 0.9970 0.0027 42.44 0.9968 0.0049 39.18 0.9953 Table 3: Image quality evaluation on unseen frames for the D-Ne RF dataset with our framework applied on top of the 3D Geometry-aware Deformable Gaussians (Lu et al., 2024). Image resolution is scaled down to 400x400 pixels and the background is white, maintaining the settings of the GA3D paper evaluation.

Published as a conference paper at ICLR 2025

Balloon Truck Jumping Dynamic Face

Method LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM LPIPS PSNR SSIM

D3G (Yang et al., 2023) 0.1584 26.79 0.9349 0.2922 26.01 0.9046 0.2726 23.12 0.8958 0.0806 29.22 0.9756 Ours - PIII 0.1592 26.96 0.9348 0.2782 25.49 0.9071 0.2720 22.89 0.8971 0.0794 29.30 0.9761 Ours - PIV 0.1578 26.95 0.9356 0.2533 26.66 0.9197 0.2501 23.63 0.9037 0.0793 29.23 0.9754

Table 4: Unseen frames evaluation for the dynamic scenes dataset (Yang et al., 2023).

Figure 8: Qualitative comparison of our method to D3G (Yang et al., 2023) on the dynamic scenes dataset (Yoon et al., 2020).

Figure 9: Training frames from the rotating colored box scene.

cos2πt sin2πt 0 sin2πt cos2πt 0 0 0 1

. Figure 9 shows selected images from the training set. Two training

procedures are considered: (i) a baseline approach using only the reconstruction loss, similar to the

Published as a conference paper at ICLR 2025

Figure 10: Evaluation of unseen timestamps in the rotating colored box scene.

Figure 11: Visualization of {µi(t), µi(t)} from multiple timestamps.

D3G model; and, (ii) our approach where both the reconstruction loss and the Re Matching loss are optimized. For the Re Matching loss, we employ the global rigid motion prior class PII. Figure 10 compares ground-truth images from time-stamps unseen during training with the model s predicted renderings. Notably, the Re Matching loss allows the model to generalize in alignment with the ground-truth flow ϕt. This is a result of the Re Matching objective s ability to converge to matched priors ut that accurately recover the ground truth velocities

tϕt. To further support this claim, Figure

Published as a conference paper at ICLR 2025

Figure 12: Visualization of the training set made up of three functions ψGT( ,ti).

11 illustrates the velocities of the dynamic Gaussian centers, { µi(t)}. These results demonstrate that the Re Matching loss effectively controls { µi(t)}, resulting with solutions { µi(t)} that match the prior class PII and recover the ground-truth velocities

The V = C1(Rd) case. We evaluate these settings for d = 2. Using the flow described above, we define the following ground-truth scalar function, ψGT R2 R, as:

ψGT(x,t) = ϕ 1 t (S 1x) b. (76)

The training data is constructed using three time-stamps, specifically t {0.0,0.25,0.5}. The

parameter choices for this procedure are: b = 0.2, and S = diag {0.6,1.4}, R(t) = [ cosπt sinπt sinπt cosπt].

Figure 12 visualizes the three distinct functions ψGT( ,t) that constitute the training set. To model ψt, we employ a multi-layer perceptron (MLP) architecture, as described in Section 8.2.1, with the only modification of a scalar output dimension in the final layer. For the reconstruction loss, we adopt the standard L1 loss:

LREC = 3 i=1 E ψ(x,ti) ψGT(x,ti) . (77)

For the Re Matching loss, we employ the global rigid motion prior class PII. Two training procedures are considered: (1) a baseline approach where only the reconstruction loss is used as the optimization objective, and (2) our suggested approach where both the reconstruction loss and the Re Matching loss are optimized. Figure 13 displays the results of the trained models for times t {0,0.0625,0.125,0.1875,0.25,0.3125,0.375,0.4375}. Among these, only t {0,0.25,0.5}, shown in the leftmost column, correspond to the training set frames. While both the baseline and our approach perform similarly on the training frames, the results for unseen frames clearly demonstrate the benefits of incorporating the Re Matching loss. Specifically, the Re Matching loss allows the model to recover the ground-truth flow ϕt, avoiding the unrealistic distortions observed in the baseline results. To further illustrate this, Figure 14 depicts the matched priors ut (white arrows) obtained by solving equation 5, alongside the velocity field of the ground-truth flow tϕt (green arrows). These comparisons show that the Re Matching loss successfully converges to matched priors ut that closely approximate the ground truth. In contrast, the matched priors ut obtained with the baseline approach (without the Re Matching loss) deviate significantly from the ground truth. This emphasizes the importance of the reprojection procedure. Not all velocity fields in the prior class are suitable for guiding the optimization process, but the Re Matching loss ensures convergence to an appropriate prior, enabling an accurate recovery of the underlying flow.

8.5 RUNTIME AND CONVERGENCE ANLAYSIS

We note that our framework is applied solely during the training phase of the algorithm, leaving inference times unaffected. To evaluate computational efficiency, we measured the average time (seconds) for a forward and backward pass over 100 iterations for varying sizes n of Gaussians sets.

Published as a conference paper at ICLR 2025

Figure 13: Comparisons of ψt converged solutions between the baseline and our approach, displayed in order of increasing time (left to right, top to bottom).

Figure 15 presents the results, comparing the computation time of the Re Matching framework to the D3G baseline. The runtime analysis was conducted on a single NVIDIA RTX A6000.

To examine the convergence of the reconstruction model, we compare the loss convergence curves of the D3G (Yang et al., 2023) model and our model. Figure 16 shows that the addition of the Re Matching loss does not affect the convergence behavior of the optimization. We also show the loss curve of the Re Matching loss itself in figure 17. It is important to note that for the Re Matching formulation, the optimal solution does not necessarily achieve 0 loss, simliarly to the APM procedure (Deutsch, 1992). Instead, it achieves the lowest loss possible given the reconstruction task and selected prior.

8.6 ABLATION OF FRAMEWORK HYPERPARAMETERS

Published as a conference paper at ICLR 2025

Figure 14: Comparisons of the converged ut (white arrows) between the baseline and our approach that uses the Re Matching loss. The ground-truth velocities,

tϕt, are shown as green arrows. When the matched ut aligns with the ground truth, the green arrows become indistinguishable.

Figure 15: Combined average time (seconds) for a forward and backward pass for varying size of n.

In this section, we present an ablation study on key hyperparameters introduced by the Re Matching framework: i) the weight of the Re Matching loss, λ, as defined in equation 8; ii) maximum number of parts selection, k, for the adaptive prior class; and iii) the weight of the entropy loss (equation 29), used to optimize the learned part assignments when employing an adaptive prior class.

Published as a conference paper at ICLR 2025

Figure 16: Loss curves report of our model and D3G (Yang et al., 2023) over 40k training iterations.

Figure 17: Loss curve report for the Re Matching loss, showing a running average with a window size of 20.

Re Matching loss weight. We note first that a consistent value of λ = 0.001 was used across all scenes experimented with in section 6, already demonstrating the robustness of this parameter. To further test this, we conducted experiments on the Hell Warrior and Lego scenes from the D-Ne RF

Published as a conference paper at ICLR 2025

Figure 18: Effect of the weight parameter λ on the PSNR evaluation metric for the Hell Warrior scene (left) and the Lego scene (right).

dataset, evaluating how different λ values influence solution quality. Figure 18 shows these findings. We note that for the Hell Warrior scene, we employed the PIV prior class, while the Lego scene used the PV class. The results indicate stable improvement within the range of λ [5e 4,5e 3], while small values λ 5e 5 aligns with the baseline. Larger values, λ 1e 2, may compete with the reconstruction loss, leading to suboptimal solutions.

Maximum number of parts selection. To assess the impact of the hyperparameter k, we selected the Mutant scene, which aligns with the adaptive prior class PIV , and the Lego scene, corresponding to the adaptive prior class PV and evaluated how varying k affects solution quality. Figure 19 presents the results of this analysis. The findings suggest relatively stable performance within the range k = 5 to k = 15, offering flexibility in selecting k based on leveraging prior knowledge about the expected number of moving parts in the scene. As a special case for k = 1, the adaptive prior PV corresponds to PI. If we train with Re Matching using this prior on the three selected scenes we get the following quantitative results: Lego (PSNR: 24.94, SSIM: 0.9536, LPIPS: 0.0452), Hell Warrior (PSNR: 40.77, SSIM: 0.9917, LPIPS: 0.0285), Mutant (PSNR: 41.88, SSIM: 0.9965, LPIPS: 0.0069).

Entropy loss weight. Similar to the λ hyperparameter, the entropy loss weight was kept fixed across all experiments in section 6. To further examine its impact, we evaluated its influence on performance with varying weight values for the Hell Warrior scene. Figure 20 presents the results of this experiment, demonstrating stable performance within the range [1e 4,1e 3].

8.7 ADDITIONAL QUALITATIVE EVALUATION

To further support the qualitative results presented in Figures 2 and 3, the supplementary material includes additional evidence showcasing novel-view video reconstruction results. These comparisons highlight the performance of our model relative to baseline approaches, providing a more comprehensive validation of its efficacy.

Published as a conference paper at ICLR 2025

Figure 19: Impact of varying k values on the PSNR evaluation metric for the Mutant scene (left) and the Lego scene (right).

Figure 20: Impact of varying entropy loss weights on the PSNR evaluation metric for the Hell Warrior scene.