# reinforcement_learning_with_neural_radiance_fields__bc06fa50.pdf

Reinforcement Learning with Neural Radiance Fields

Danny Driess TU Berlin Ingmar Schubert TU Berlin Pete Florence Google Yunzhu Li MIT Marc Toussaint TU Berlin

It is a long-standing problem to find effective representations for training reinforcement learning (RL) agents. This paper demonstrates that learning state representations with supervision from Neural Radiance Fields (Ne RFs) can improve the performance of RL compared to other learned representations or even low-dimensional, hand-engineered state information. Specifically, we propose to train an encoder that maps multiple image observations to a latent space describing the objects in the scene. The decoder built from a latent-conditioned Ne RF serves as the supervision signal to learn the latent space. An RL algorithm then operates on the learned latent space as its state representation. We call this Ne RF-RL. Our experiments indicate that Ne RF as supervision leads to a latent space better suited for the downstream RL tasks involving robotic object manipulations like hanging mugs on hooks, pushing objects, or opening doors. Video: https://dannydriess.github.io/nerf-rl

1 Introduction

The sample efficiency of reinforcement learning (RL) algorithms crucially depends on the representation of the underlying system state they operate on [1, 2, 3, 4, 5, 6, 7]. Sometimes, a low-dimensional (direct) representation of the state, such as the positions of the objects in the environment, is considered to make the resulting RL problem most efficient [2].

However, such low-dimensional, direct state representations can have several disadvantages. On the one hand, a perception module, e.g., pose estimation, is necessary in the real world to obtain the representation from raw observations, which often is difficult to achieve in practice with sufficient robustness. On the other hand, if the goal is to learn policies that generalize over different object shapes [8], using a low-dimensional state representation is often impractical. Such scenarios, while challenging for RL, are common, e.g., in robotic manipulation tasks.

Therefore, there is a large history of approaches that consider RL directly from raw, high-dimensional observations like images (e.g., [9, 10]). Typically, an encoder takes the high-dimensional input and maps it to a low-dimensional latent representation of the state. The RL algorithm (e.g., the Q-function or the policy network) then operates on the latent vector as state input. This way, no separate perception module is necessary, the framework can extract information from the raw observations that are relevant for the task, and the RL agent, in principle, may generalize over challenging environments, in which, e.g., object shapes are varied. While these are advantages in principle, jointly training encoders capable of processing high-dimensional inputs from the RL signal alone is challenging. To address this, one approach is to pretrain the encoder on a different task, e.g., image reconstruction [1, 4, 11], multi-view consistency [6], or a time-constrastive task [3]. Alternatively, an auxiliary loss on the latent encoding can be added during the RL procedure [5].

In both cases, the choice of the actual (auto-)encoder architecture and associated (auxiliary) loss function has a significant influence on the usefulness of the resulting latent space for the downstream

equal contribution. Correspondence: danny.driess@gmail.com

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

RL task. Especially for image data, convolutional neural networks (CNNs) are commonly used for the encoder [12]. However, 2D CNNs have a 2D (equivariance) bias, while for many RL tasks, the 3D structure of our world is essential. Architectures like Vision Transformers [13, 14] process images with no such direct 2D bias, but they often require large scale data, which might be challenging in RL applications. Additionally, although multiple uncalibrated 2D image inputs can be used with generic image encoders [15], they do not benefit from 3D inductive biases, which may help for example in resolving ambiguities in 2D images such as occlusions and object permanence.

Recently, Neural Radiance Fields (Ne RFs) [16] have shown great success in learning to represent scenes with a neural network that enables to render the scene from novel viewpoints, and have sparked broad interest in computer vision [17]. Ne RFs exhibit a strong 3D inductive bias, leading to better scene reconstruction capabilities than methods composed of generic image encoders (e.g., [18]).

In the present work, we investigate whether incorporating these 3D inductive biases of Ne RFs into learning a state representation can benefit RL. Specifically, we propose to train an encoder that maps multiple RGB image views of the scene to a latent representation through an auto-encoder structure, where a (compositional) Ne RF decoder provides the self-supervision signal using an image reconstruction loss for each view.

In the experiments, we show for multiple environments that supervision from Ne RF leads to a latent representation that makes the downstream RL procedure more sample efficient compared to supervision via a 2D CNN decoder, a contrastive loss on the latent space, or even hand-engineered, perfect low-level state information given as keypoints. Commonly, RL is trained on environments where the objects have the same shape. Our environments include hanging mugs on hooks, pushing objects on a table, and a door opening scenario. In all of these, the objects shapes are not fixed, and we require the agent to generalize over all shapes from a distribution.

To summarize our main contributions: (i) we propose to train state representations for RL with Ne RF supervision, and (ii) we empirically demonstrate that an encoder trained with a latent-conditioned Ne RF decoder, especially with an object-compositional Ne RF decoder, leads to increased RL performance relative to standard 2D CNN auto-encoders, contrastive learning, or expert keypoints.

2 Related Work

Neural Scene/Object Representations in Computer Vision, and Applications. To our knowledge, the present work is the first to explore if neural scene representations like Ne RFs can benefit RL. Outside of RL, however, there has been a very active research field in the area of neural scene representations, both in the representations themselves [19, 20, 21, 22] and their applications; see [23, 24, 17] for recent reviews. Within the family of Ne RFs and related methods, major thrusts of research have included: improving modeling formulations [25, 26], modeling larger scenes [26, 27], addressing (re-)lighting [28, 29, 30], and an especially active area of research has been in improving speed, both of training and of inference-time rendering [31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41]. In our case, we are not constrained by inference-time computation issues, since we do not need to render images, and only have to run our latent-space encoder (with a runtime of approx. 7 ms on an RTX3090). Additionally of particular relevance, various methods have developed latent-conditioned [42, 43, 44] or compositional/object-oriented approaches for Ne RFs [45, 46, 47, 48, 49, 50, 51, 52, 53], although they, nor other Ne RF-style methods to our knowledge, have been applied to RL. Neural scene representations have found application across many fields (i.e., augmented reality and medical imaging [54]) and both Ne RFs [55, 56, 57, 58] and other neural scene approaches [59, 60, 61, 62] have started to be used for various problems in robotics, including pose estimation [55], trajectory planning [56], visual foresight [11, 53], grasping [59, 57], and rearrangement tasks [60, 61, 58].

Learning State Representations for Reinforcement Learning. One of the key enabling factors for the success of deep RL is its ability to find effective representations of the environment from high-dimensional observation data [10, 63]. Extensive research has gone into investigating different ways to learn better state representations using various auxiliary objective functions. Contrastive learning is a common objective and has shown success in unsupervised representation learning in computer vision applications [64, 65]. Researchers built upon this success and have shown such learning objectives can lead to better performance and sample efficiency in deep RL [66, 67], where the contrasting signals could come from time alignment [68, 3], camera viewpoints [69], and different sensory modalities [70], with applications in real-world robotic tasks [6, 71]. Extensive efforts

have investigated the role of representation learning in RL [72], provided a detailed analysis of the importance of different visual representation pretraining methods [73], and shown how we can improve training stability in the face of multiple auxiliary losses [74]. There is also a range of additional explorations on pretraining methods with novel objective functions (e.g., bisimulation metrics [75] and temporal cycle-consistency loss [76]) and less-explored data sources (e.g., in-thewild images [77] and action-free videos [78]). Please check the survey for more related work in this direction [79]. Our method is different in that we explicitly utilize a decoder that includes strong 3D inductive biases provided by Ne RFs, which we empirically show improves RL for tasks that depend on the geometry of the objects.

3 Background

3.1 Reinforcement Learning

This work considers decision problems that can be described as discrete-time Markov Decision Processes (MDPs) M = S, A, T, γ, R, P0 . S and A are the sets of all states and actions, respectively. The transition probability (density) from s to s using an action a is T(s | s, a). The agent receives a real-valued reward R(s, a, s ) after each step. The discount factor γ [0, 1) trades off immediate and future rewards. P0 : S R+ 0 is the distribution of the start state. RL algorithms try to find the optimal policy π : S A R+ 0 , where π = argmaxπ P t=0 γt Est+1 T ( |st,at), at π( |st),s0 P0 [R(st, at, st+1)] . Importantly, in this work, we consider RL problems where the state s encodes both the position and the shape of the objects in the scene. We require the RL agent to generalize over all of these shapes at test time. We can therefore think of the state as a tuple s = (sp, ss), where sp encodes positional information, and ss encodes the shapes involved. We focus the experiments on sparse reward settings, meaning R(s, a, s ) = R0 > 0 for s Sg and R(s, a, s ) = 0 for s S\Sg, where the volume of Sg S is much smaller than the volume of S. The state space S usually is low-dimensional or a minimal description of the degrees of freedom of the system. In this work, we consider that the RL algorithm has only access to a (high-dimensional) observation y Y of the scene (e.g., RGB images). In particular, this means that the policy has observations as input a π( | y). Since we assume that the underlying state s = (sp, ss) is fully observable from y, we can treat y like a state for an MDP.

Reinforcement Learning with Learned Latent Scene Representations. The general idea of RL with learned latent scene representations is to learn an encoder Ωthat maps an observation y Y to a k-dimensional latent vector z = Ω(y) Z Rk of the scene. The actual RL components, e.g., the Q-function or policy, then operate on z as its state description. For a policy π, this means that the action a π( | z) = π( | Ω(y)) is conditional on the latent vector z instead of the observation y directly. The dimension k of the latent vector is typically (much) smaller than that of the observation space Y, but larger than that of the state space S.

3.2 Neural Radiance Fields (Ne RFs)

The general idea of Ne RF, originally proposed by [16], is to learn a function f = (σ, c) that predicts the emitted RGB color value c(x) R3 and volume density σ(x) R 0 at any 3D world coordinate x R3. Based on f, an image from an arbitrary view and camera parameters can be rendered by computing the color C(r) R3 of each pixel along its corresponding camera ray r(α) = r(0) + αd through the volumetric rendering relation

C(r) = Z αf

αn Tf(r, α)σ(r(α))c(r(α)) dα with Tf(r, α) = exp Z α

αn σ(r(u)) du . (1)

Here, r(0) R3 is the camera origin, d R3 the pixel dependent direction of the ray and αn, αf R the near and far bounds within which objects are expected, respectively. The camera rays are determined from the camera matrix K (intrinsics and extrinsics) describing the desired view.

4 Learning State Representations for RL with Ne RF Supervision

This section describes our proposed framework, in which we use a latent state space for RL that is learned from Ne RF supervision. For learning the latent space, we use an encoder-decoder where the

Observations

Actions image masks

Ne RF Decoder

Train offline,

then freeze

Train online

Reconstructions

Figure 1: State representation learning for RL with Ne RFs. First, the encoder and Ne RF decoder are trained with supervision from a multi-view reconstruction loss on an offline dataset. Then, the encoder s weights are frozen, and the latent space is used as state input to train a policy with RL. Masks of individual objects are only required for the compositional variant of our encoder.

decoder is a latent-conditioned Ne RF, which may either be a global [42, 43, 44] or a compositional Ne RF decoder [53]. To our knowledge, no prior work has used such Ne RF-derived supervision for RL. In Sec. 4.1 we describe this proposition, Sec. 4.2 provides an overview of the encoder-decoder training, Sec. 4.3 and Sec. 4.4 introduce options for the Ne RF decoder and encoder, respectively.

4.1 Using Latent-Conditioned Ne RF for RL

We propose the state representation z on which an RL algorithm operates to be a latent vector produced by an encoder that maps images from multiple views to a latent z, which is trained with a (compositional) latent-conditioned Ne RF decoder. As will be verified in experiments, we hypothesize that this framework is beneficial for the downstream RL task, as it produces latent vectors that represent the actual 3D geometry of the objects in the scene, can handle multiple objects well, as well as fuse multiple views in a consistent way to deal with occlusions by providing shape completion, all of which is relevant to solve tasks where the geometry is important. There are two steps to our framework, as shown in Fig. 1. First, we train the encoder + decoder from a dataset collected by random interactions with the environment, i.e., we do not yet need a trained policy. Second, we take the encoder trained in the first step, which we leave frozen, and use the latent space to train an RL policy. Note that we investigate two variants of the auto-encoder framework, a global one, where the whole scene is represented by one single latent vector, and a compositional one, where objects are represented by their own latent vector. For the latter, objects are identified by masks in the views.

4.2 Overview: Auto-Encoder with Latent-Conditioned Ne RF Decoder

Assume that an observation y = I1:V , K1:V , M 1:V of the scene consists of RGB images Ii R3 h w, i = 1, . . . , V taken from V many camera views, their respective camera projection matrices Ki R3 4 (including both intrinsics and extrinsics), and per-view image masks M 1:V . For a global Ne RF decoder, these are global non-background masks M i tot {0, 1}h w, and for a compositional Ne RF decoder as in [53], these are sets of binary masks M i j {0, 1}h w that identify the objects j = 1, . . . , m in the scene in view i. The global case is equivalent to m = 1, M i j=1 = M i tot. The encoder Ωmaps these posed image observations from the multiple views into a set of latent vectors z1:m, where each zj represents each object in the scene separately in the compositional case, or the single z1 all objects in the scene. This is achieved by querying Ωon the masks M 1:V j , i.e.,

zj = Ω I1:V , K1:V , M 1:V j Rk (2)

for object j. The supervision signal to train the encoder is the image reconstruction loss

Li = Ii M i tot D Ω I1:V , K1:V , M 1:V 1:m , Ki 2 2 (3)

on the input view i where the decoder D renders an image I = D(z1:m, K) for arbitrary views specified by the camera matrix K from the set of latent vectors z1:m. Both the encoder and decoder

are trained end-to-end at the same time. The target images for the decoder are the same in both the global and compositional case: the global-masked image Ii M i tot ( is the element-wise product). In the compositional case this can be computed with M i tot = Wm j=1 M i j. By fusing the information from multiple views of the objects into the latent vector from which the decoder has to be able to render the scene from multiple views, this auto-encoder framework can learn latent vectors that represent the 3D configurations (shape and pose) of the objects in the scene.

4.3 Latent-Conditioned Ne RF Decoder Details

Global. The original Ne RF formulation [16] learns a fully connected network f that represents one single scene (Sec. 3.2). In order to create a decoder from Ne RFs within an auto-encoder to learn a latent space, we condition the Ne RF f( , z) on the latent vector z Rk [42, 43, 44]. While approaches such as [42, 43, 44] use the latent code to represent factors such as lighting or categorylevel generalization, in our case the latent code is intended to represent the scene variation, i.e., shape and configuration of objects, such that a downstream RL agent may use this as a state representation.

Compositional. In the compositional case, the encoder produces a set of latent vectors z1:m describing each object j = 1, . . . , m individually, this leads to m many Ne RFs (σj(x), cj(x)) = fj(x) = f(x, zj), j = 1, . . . , m with their associated volume density σj and color value cj. Note that while one could use different networks fj with their own network weights for each object, we have a single network f for all objects. This means that both the object s pose as well as its shape and type are represented through the latent code zj. In order to force those conditioned Ne RFs to learn the 3D configuration of each object separately, we compose them into a global Ne RF model with the composition formulas (proposed e.g., by [80, 81]): σ(x) = Pm j=1 σj(x), c(x) = 1 σ(x) Pm j=1 σj(x)cj(x). As this composition happens in 3D space, the latent vectors will be learned such that they correctly represent the actual shape and pose of the objects in the scene with respect to the other objects, which we hypothesize may be useful for the downstream RL agent.

4.4 Encoder Details

The encoder Ωoperates by fusing multiple views together to estimate the latent vector for the RL task. Since the scientific question of this work is to investigate whether a decoder built from Ne RFs to train the encoder end-to-end is beneficial for RL, we consider two different encoder architectures. The first one is a 2D CNN that averages feature encodings from the different views, where each encoding is additionally conditioned on the camera matrix of that view. The second one is based on a learned 3D neural vector field that incorporates 3D biases by fusing the different camera views in 3D space through 3D convolutions and camera projection. This way, we are able to distinguish between the importance of 3D priors incorporated into the encoder versus the decoder.

Per-image CNN Encoder ( Image encoder ). For the global version, we utilize the network architecture from [11] as an encoder choice. In order to work with multiple objects in the compositional case, we modify the architecture from [11] by taking the object masks into account as follows. For each object j, the 2D CNN encoder computes

zj = ΩCNN I1:V , K1:V , M 1:V j = h MLP

i=1 g MLP ECNN Ii M i j , Ki !

ECNN is a Res Net-18 [82] CNN feature extractor that determines a feature from the masked input image Ii M i j of object j for each view i, which is then concatenated with the (flattened) camera matrix. The output of the network g MLP is hence the encoding of each view, including the camera information, which is averaged and then processed with h MLP, to produce the final latent vector. Note that in the global case, we set m = 1, M i j=1 = M i tot such that ΩCNN produces a single latent vector.

Neural Field 3D CNN Encoder ( Field encoder ). Several authors [43] have considered to incorporate 3D biases into learning an encoder by computing pixel-aligned features from queried 3D locations of the scene to fuse the information from the different camera views directly in 3D space. We utilize the encoder architecture from [53], where the idea is to learn a neural vector field ϕ I1:V , M 1:V j : R3 RE over 3D space, conditioned on the input views and masks. The features of ϕ are computed from projecting the query point into the camera coordinate system from the respective view. To turn ϕ into a latent vector, it is queried on a workspace set Xh Rd X h X w X (a 3D

grid) and then processed by a 3D convolutional network, i.e., zj = E3D CNN ϕ I1:V , M 1:V j (Xh) . This method differs from [43, 83, 60] by computing a latent vector from the pixel-aligned features.

5 Baselines / Alternative State Representations

In this section, we briefly describe alternative ways of training an encoder for RL, which we will investigate in the experiments as baselines and ablations. For details, refer to the appendix.

Conv. Autoencoder. This baseline uses a standard CNN decoder based on deconvolutions instead of Ne RF to reconstruct the image from the latent representation, similar to [1]. Therefore, with this baseline we investigate the influence of the Ne RF decoder relative to CNN decoders. We follow the architecture of [11] for the deconvolution part for the global case. In the compositional case, we modify the architecture to be able to deal with a set of individual latent vectors instead of a single, global one. The image I = Ddeconv(g MLP( 1

m Pm j=1 zj), K) is rendered from z1:m by first averaging the latent vectors and then processing the averaged vector with a fully connected network g MLP, leading to an aggregated feature. This aggregated feature is concatenated with the (flattened) camera matrix K describing the desired view and then rendered into the image with Ddeconv. In the experiments, we utilize this decoder as the supervision signal to train the latent space produced by the 2D CNN encoder from Sec. 4.4. In the compositional version, the 2D CNN encoder (4) use the same object masks as the compositional Ne RF-RL variant.

Contrastive Learning. As an alternative to learning an encoder via a reconstruction loss, the idea of contrastive learning [84] is to define a loss function directly on the latent space that tries to pull latent vectors describing the same configurations together (called positive samples) while ones representing different system states apart (called negative samples). A popular approach to achieve this is with the Info NCE loss [85, 64]. Let yi and yi be two different observations of the same state. Here, denotes a perturbed/augmented version of the observation. For a mini-batch of observations {(yi, yi)}n i=1, after encoding those into their respective latent vectors zi = Ω(yi), zi = Ω( yi) with the encoder Ω, the loss for that batch would use (zi, zi) as a positive pair, and (zi, z =i) as a negative pair, or some similar variation. A crucial question in contrastive learning is how the observation y is perturbed/augmented into y to generate positive and negative training pairs, described in the following.

CURL. In CURL [5], the input image is randomly cropped to generate y and y. We closely follow the hyperparameters and design of [5]. CURL operates on a single input view and we choose a view for this baseline from which the state of the environment can be inferred as best as possible (Fig. 17).

Multi-View CURL. This baseline investigates if the neural field 3D encoder (Sec. 4.4) can be trained with a contrastive loss. As this encoder operates on multiple input views we double the number of available camera views. Half of the views are the same as in the other experiments, the other half are captured from sightly perturbed camera angles. We use the same loss as CURL, but with different contrastive pairs rather than from augmentation, the contrastive style is taken from TCN [68]: the positive pairs come from different views but at the same moment in time, while negative pairs come from different times. Therefore, this baseline can be seen as a multi-view adaptation of CURL [5].

Direct State / Keypoint Representations. Finally, we also consider a direct, low-dimensional representation of the state. Since we are interested in generalizing over different object shapes, we consider multiple 3D keypoints that are attached at relevant locations of the objects by expert knowledge and observed with a perfect keypoint detector [8]. See Fig. 2b for a visualization of those keypoints. The keypoints both provide information about object shape and its pose. Furthermore, as seen in Fig. 2b, they have been chosen to reflect those locations in the environment relevant to solve the task. Additionally, we report results where the state is represented by the poses of the objects as this cannot represent object shape, in this case we use a constant object shape for training and test.

6 Experiments

We evaluate our proposed method on different environments where the geometry of the objects in the scene is important to solve the task successfully. Please also refer to the video https://dannydriess.github.io/nerf-rl. Commonly, RL is trained and evaluated on a single environment, where only the poses are changed, but the involved object shapes are kept constant. Since latent-conditioned Ne RFs have been shown to be capable of generalizing over geometry [43],

we consider experiments where we require the RL agent to generalize over object shapes within some distribution. Answering the scientific question of this work requires environments with multi-view observations and for the compositional versions object masks as well. These are not provided in standard RL benchmarks, which is the reason for choosing the environments investigated in this work. We use PPO [86] as the RL algorithm and four camera views in all experiments. Refer to the appendix for more details about our environments, parameter choices, network architectures, and training times.

6.1 Environments

Mug on Hook. In this environment, adopted from [87] and visualized in Fig. 2b, the task is to hang a mug on a hook. Both the mug and the hook shape are randomized. The actions are small 3D translations applied to the mug. This environment is challenging as we require the RL agent to generalize over mug and hook shapes and the tolerance between the handle opening and the hook is relatively small. Further, the agent receives a sparse reward only if the mug has been hung stably. This reward is calculated by virtually simulating a mug drop after each action. If the mug does not fall onto the ground from the current state, a reward of one is assigned, otherwise zero.

Planar Pushing. The task in this environment, shown in Fig. 3b, is to push yellow box-shaped objects into the left region of the table and blue objects into the right region with the red pusher that can move in the plane, i.e., the action is two dimensional. This is the same environment as in [53] with the same four different camera views. Each run contains a single object on the table (plus the pusher). If the box has been pushed inside its respective region, a sparse reward of one is received, otherwise zero. The boxes in the environment have different sizes, two colors and are randomly initialized. In this environment, we cannot use keypoints for the multi-shape setting, as the reward depends on the object color; we evaluate the keypoints baseline only in the single shape case (Appendix).

Door Opening. Fig. 4b shows the door environment, where the task is to open a sliding door with the red end-effector that can be translated in 3 Do Fs as the action. To solve this task, the agent has to push on the door handle. As the handle position and size is randomized, the agent has to learn to interact with the handle geometry accordingly. Interestingly, as can be seen in the video in the supplementary material, the agent often chooses to push on the handle only at the beginning, as, afterwards, it is sufficient to push the door itself at its side. The agent receives a sparse reward if the door has been opened sufficiently, otherwise, zero reward is assigned.

6.2 Results

Figs 2a, 3a, 4a show success rates (averaged over 6 independent experiment repetitions and over 30 test rollouts per repetition per timestep) as a function of training steps. Also shown are the 68% confidence intervals. These success rates have been evaluated using randomized object shapes and initial conditions, and therefore reflect the agent s ability to generalize over these.

In all these experiments, a latent space trained with compositional Ne RF supervision as the decoder consistently outperformed all other learned representations, both in terms of sample efficiency and asymptotic performance. Furthermore, our proposed framework with compositional Ne RF even outperforms the expert keypoint representation. For the door environment, the 3D neural field encoder plus Ne RF decoder (Ne RF-RL comp. + field) reaches nearly perfect success rates. For the other two environments, the compositional 2D CNN encoder plus Ne RF decoder (Ne RF-RL comp. + image) was slightly better than with the neural field encoder but not significantly. This shows that the decoder built from compositional Ne RF is relevant for the performance, not so much the choice of the encoder.

Training the 3D neural field encoder with a contrastive loss as supervision signal for different camera views as positive/negative training pairs is not able to achieve significant learning progress in these scenarios (Multi-CURL). However, the other contrastive baseline, CURL, which has a different encoder and uses image cropping as data augmentation instead of additional camera views, is able to achieve decent performance and sample efficiency on the door environment, but not for the pushing environment. In the mug environment, CURL initially is able to make learning progress comparable to our framework, but never reaches a success rate above 59% and then becomes unstable. Similarly, the global CNN autoencoder baseline shows decent learning progress initially on the mug and pushing scenario (not for the door), but then becomes unstable (mug) or never surpasses 50% success rate

encoder decoder comp. Ne RF loss

Ne RFcomp.+field 3D CNN comp. 3D Ne RF image reconstr.: L2 RL comp.+image 2D CNN comp. 3D Ne RF image reconstr.: L2 (ours) global+image 2D CNN global 3D Ne RF image reconstr.: L2

Conv. Autoencoder, c 2D CNN comp. 2D CNN image reconstr.: L2 Conv. Autoencoder, g 2D CNN 2D CNN image reconstr.: L2 CURL 2D CNN - contrast: Info NCE Multi-CURL 3D CNN - contrast: Info NCE

Keypoints chosen by expert knowledge and perfect extraction

Table 1: Overview of the different state representation learning frameworks.

0 1 2 3 4 5 0

Number of experienced transitions N 106

Success Rate

(a) Learning curve.

Ne RF-RL comp. + field Ne RF-RL comp. + image Conv. Autoencoder, c Conv. Autoencoder, g CURL Multi-CURL Keypoints

(b) Left: Blue coordinate frames denote the four camera poses. Red points are the expert keypoints. Right: Ne RF renderings (different scenes).

Figure 2: Mug on hook environment. (b) shows an example scene and Ne RF renderings

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 0

Number of experienced transitions N 106

Success Rate

(a) Learning curve.

Ne RF-RL comp. + field Ne RF-RL comp. + image Ne RF-RL global + image Conv. Autoencoder, c Conv. Autoencoder, g CURL Multi-CURL

(b) Left: Blue coordinate frames denote the four camera poses. Yellow and blue areas are goal regions where the objects should be pushed to. Right: Ne RF renderings.

Figure 3: Pushing environment. (b) shows Ne RF renderings for different scenes.

0 1 2 3 4 5 6 7 8 0

Number of experienced transitions N 105

Success Rate

(a) Learning curve.

Ne RF-RL comp. + field Ne RF-RL comp. + image Conv. Autoencoder, c Conv. Autoencoder, g CURL Multi-CURL Keypoints

(b) Left: Blue coordinate frames denote the four camera poses. Right: Ne RF renderings.

Figure 4: Door environment. (b) shows Ne RF renderings for different scenes.

(pushing). Such variations in performance or instable learning across the different environments have not been observed with our method, which is stable in all cases.

The compositional variant (Ne RF-RL comp.) of our framework achieves the highest performance. Since the conv. comp. autoencoder baseline has worse performance than its global variant, compositionality alone is not the sole reason for the better performance of our state representation. Indeed, the global Ne RF-RL + image variant in the pushing env. is also better than all other baselines.

In the appendix Sec. A.1, we find a positive correlation between Ne RF reconstruction quality and RL performance. Furthermore, it turns out that the performance of our framework is not significantly affected when we pretrain the encoder with less data (Sec. A.2). In Sec. A.3, we investigate the influence of the number of input views on the RL performance. In the pushing scenario, only two or even one input view are sufficient for good performance. However, for tasks that require more 3D understanding such as the mug scenario, we observe a drop in performance when reducing the number of views from 4 to 2.

7 Discussion

Why Ne RF provides better supervision. The Ne RF training objective (1) strongly forces each f( , zj) to represent each object in its actual 3D configuration and relative to other objects in the scene (compositional case), including their shape. This implies that the latent vectors zj have to contain this information, i.e., they are trained to determine the object type, shape and pose in the scene. In the global case, z1 has to represent the geometry of the whole secne. As the tasks we consider require policies to take the geometry of the objects into account, we hypothesize that a latent vector that is capable of parameterizing a Ne RF to reconstruct the scene in the 3D space has to contain enough of the relevant 3D information of the objects also for the policy to be successful.

Masks. In order for the auto-encoder framework to be compositional, it requires object masks. We believe that instance segmentation has reached a level of maturity [88] that this is a fair assumption to make. As we also utilize the individual masks for the compositional conv. autoencoder and the multi-view CURL baseline, which do not show good performance, it indicates that the masks are not the main reason that our state representation achieves higher performance. This is further supported by the fact that the global Ne RF-RL variant which does not rely on individual object masks on the pushing scenario achieved a performance higher than all baselines, i.e., masks will increase the performance of Ne RF-RL as they enable the compositional version, but they do not seem essential.

Offline/Online. In this work, we focused on pretraining the latent representation offline from a dataset collected by random actions. During RL, the encoder is fixed and only the policy networks are learned. This has the advantage that the same representation can be used for different RL tasks and the dataset to train the representation not necessarily has to come from the same distribution. However, if a policy is needed to explore reasonable regions of the state space, collecting a dataset offline to learn a latent space that covers the state space sufficiently might be more challenging for an offline approach. This was not an issue for our experiments where data collection with random actions was sufficient. Indeed, we show generalization over different starting states of the same environment and with respect to different shapes (within distribution). Future work could investigate Ne RF supervision in an online setup. Note that the reconstruction loss via Ne RF is computationally more demanding than via a 2D CNN deconv. decoder or a contrastive term, making Ne RF supervision as an auxiliary loss at each RL training step costly. One potential solution for this is to apply the auxiliary loss not at every RL training step, but with a lower frequency. Regarding computational efficiency, this is where contrastive learning has an advantage over our proposed Ne RF-based decoder, as the encoding with CURL can be trained within half a day, whereas the Ne RF auto-encoder took up to 2 days to train for our environments. However, when using the encoder for RL, there is no difference in inference time.

Multi-View. The auto-encoder framework we propose can fuse the information of multiple camera views into a latent vector describing an object in the scene. This way, occlusions can be addressed and the agent can gain a better 3D understanding of the scene from the different camera angles. Having access to multiple camera views and their camera matrices is an additional assumption we make, although we believe the capability to utilize this information is an advantage of our method.

8 Conclusion

In this work, we have proposed the idea to utilize Neural Radiance Fields (Ne RFs) to train latent spaces for RL. Our environments focus on tasks where the geometry of the objects in the scene is relevant for successfully solving the tasks. Training RL agents with the pretrained encoder that maps multiple views of the scene to a latent space consistently outperformed other ways of learning a state representation and even keypoints chosen by expert knowledge. Our results show that the 3D prior present in compositional Ne RF as the decoder is more important than priors in the encoder.

Broader Impacts. Our main contribution is a method to learn representations that improve the efficiency of vision-based RL, which could impact automation. As such, our work inherits general ethical risks of AI, like the question of how to address the potential of increased automation in society.

Acknowledgments

The authors thank Russ Tedrake for initial discussions; Jonathan Tompson and Jon Barron for feedback on drafts; Vincent Vanhoucke for encouraging latent Ne RFs.

This research has been supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany s Excellence Strategy EXC 2002/1 Science of Intelligence project number 390523135. Danny Driess thanks the International Max-Planck Research School for Intelligent Systems (IMPRS-IS) for the support. Ingmar Schubert acknowledges support by the German Academic Scholarship Foundation. Yunzhu Li acknowledges support by Amazon.com Services LLC, PO# #2D-06310236 and the Wistron Corporation.

[1] C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel. Deep spatial autoencoders for visuomotor learning. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 512 519. IEEE, 2016. [2] R. Jonschkowski, R. Hafner, J. Scholz, and M. Riedmiller. Pves: Position-velocity encoders for unsupervised learning of structured state representations. ar Xiv preprint ar Xiv:1705.09805, 2017. [3] D. Dwibedi, J. Tompson, C. Lynch, and P. Sermanet. Learning actionable representations from visual observations. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1577 1584. IEEE, 2018. [4] T. D. Kulkarni, A. Gupta, C. Ionescu, S. Borgeaud, M. Reynolds, A. Zisserman, and V. Mnih. Unsupervised learning of object keypoints for perception and control. Advances in neural information processing systems, 32, 2019. [5] M. Laskin, A. Srinivas, and P. Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pages 5639 5650. PMLR, 2020. [6] L. Manuelli, Y. Li, P. Florence, and R. Tedrake. Keypoints into the future: Self-supervised correspondence in model-based reinforcement learning. ar Xiv preprint ar Xiv:2009.05085, 2020. [7] M. Vecerik, J.-B. Regli, O. Sushkov, D. Barker, R. Pevceviciute, T. Rothörl, C. Schuster, R. Hadsell, L. Agapito, and J. Scholz. S3k: Self-supervised semantic keypoints for robotic manipulation via multi-view consistency. ar Xiv preprint ar Xiv:2009.14711, 2020. [8] L. Manuelli, W. Gao, P. Florence, and R. Tedrake. kpam: Keypoint affordances for categorylevel robotic manipulation. ar Xiv preprint ar Xiv:1903.06684, 2019. [9] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013. [10] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529 533, 2015. [11] Y. Li, S. Li, V. Sitzmann, P. Agrawal, and A. Torralba. 3d neural scene representations for visuomotor control. In Conference on Robot Learning, pages 112 123. PMLR, 2022.

[12] S. Lange and M. Riedmiller. Deep auto-encoder neural networks in reinforcement learning. In The 2010 International Joint Conference on Neural Networks (IJCNN), 2010. [13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [14] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020. [15] I. Akinola, J. Varley, and D. Kalashnikov. Learning precise 3d manipulation from multiple uncalibrated cameras. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 4616 4622. IEEE, 2020. [16] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pages 405 421. Springer, 2020. [17] F. Dellaert and L. Yen-Chen. Neural volume rendering: Nerf and beyond, 2021. [18] S. A. Eslami, D. Jimenez Rezende, F. Besse, F. Viola, A. S. Morcos, M. Garnelo, A. Ruderman, A. A. Rusu, I. Danihelka, K. Gregor, et al. Neural scene representation and rendering. Science, 360(6394):1204 1210, 2018. [19] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 165 174, 2019. [20] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4460 4470, 2019. [21] Z. Chen and H. Zhang. Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5939 5948, 2019. [22] V. Sitzmann, M. Zollhöfer, and G. Wetzstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations. Advances in Neural Information Processing Systems, 32, 2019. [23] Y. Xie, T. Takikawa, S. Saito, O. Litany, S. Yan, N. Khan, F. Tombari, J. Tompkin, V. Sitzmann, and S. Sridhar. Neural fields in visual computing and beyond. ar Xiv preprint ar Xiv:2111.11426, 2021. [24] A. Tewari, J. Thies, B. Mildenhall, P. Srinivasan, E. Tretschk, Y. Wang, C. Lassner, V. Sitzmann, R. Martin-Brualla, S. Lombardi, et al. Advances in neural rendering. ar Xiv preprint ar Xiv:2111.05849, 2021. [25] J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5855 5864, 2021. [26] J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. ar Xiv preprint ar Xiv:2111.12077, 2021. [27] M. Tancik, V. Casser, X. Yan, S. Pradhan, B. Mildenhall, P. P. Srinivasan, J. T. Barron, and H. Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. ar Xiv preprint ar Xiv:2202.05263, 2022. [28] M. Boss, R. Braun, V. Jampani, J. T. Barron, C. Liu, and H. Lensch. Ne RD: Neural reflectance decomposition from image collections. https://arxiv.org/abs/2012.03918, 2020. [29] P. Srinivasan, B. Deng, X. Zhang, M. Tancik, B. Mildenhall, and J. T. Barron. Ne RV: Neural reflectance and visibility fields for relighting and view synthesis. https://arxiv.org/abs/2012.03927, 2020. [30] X. Zhang, P. P. Srinivasan, B. Deng, P. Debevec, W. T. Freeman, and J. T. Barron. Nerfactor: Neural factorization of shape and reflectance under an unknown illumination. https://arxiv.org/abs/2106.01970, 2021. [31] L. Liu, J. Gu, K. Z. Lin, T.-S. Chua, and C. Theobalt. Neural sparse voxel fields. In Advances in Neural Information Processing Systems (Neur IPS), volume 33, 2020.

[32] D. Lindell, J. Martel, and G. Wetzstein. Auto Int: Automatic integration for fast neural volume rendering. https://arxiv.org/abs/2012.01714, 2020. [33] D. Rebain, W. Jiang, S. Yazdani, K. Li, K. M. Yi, and A. Tagliasacchi. De RF: Decomposed radiance fields. https://arxiv.org/abs/2011.12490, 2020. [34] T. Neff, P. Stadlbauer, M. Parger, A. Kurz, J. H. Mueller, C. R. A. Chaitanya, A. S. Kaplanyan, and M. Steinberger. DONe RF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks. Computer Graphics Forum, 40(4), 2021. ISSN 1467-8659. doi: 10.1111/cgf.14340. URL https://doi.org/10.1111/cgf.14340. [35] S. J. Garbin, M. Kowalski, M. Johnson, J. Shotton, and J. Valentin. Fastnerf: High-fidelity neural rendering at 200fps. https://arxiv.org/abs/2103.10380, 2021. [36] C. Reiser, S. Peng, Y. Liao, and A. Geiger. Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps, 2021. [37] A. Yu, R. Li, M. Tancik, H. Li, R. Ng, and A. Kanazawa. Plenoctrees for real-time rendering of neural radiance fields. In ar Xiv, 2021. [38] S. Lombardi, T. Simon, G. Schwartz, M. Zollhoefer, Y. Sheikh, and J. Saragih. Mixture of volumetric primitives for efficient neural rendering, 2021. [39] A. Yu, S. Fridovich-Keil, M. Tancik, Q. Chen, B. Recht, and A. Kanazawa. Plenoxels: Radiance fields without neural networks. ar Xiv preprint ar Xiv:2112.05131, 2021. [40] V. Sitzmann, S. Rezchikov, W. T. Freeman, J. B. Tenenbaum, and F. Durand. Light field networks: Neural scene representations with single-evaluation rendering. In ar Xiv, 2021. [41] T. Müller, A. Evans, C. Schied, and A. Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1 102:15, July 2022. doi: 10.1145/3528223.3530127. URL https://doi.org/10.1145/3528223.3530127. [42] R. Martin-Brualla, N. Radwan, M. S. Sajjadi, J. T. Barron, A. Dosovitskiy, and D. Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7210 7219, 2021. [43] A. Yu, V. Ye, M. Tancik, and A. Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578 4587, 2021. [44] Q. Wang, Z. Wang, K. Genova, P. P. Srinivasan, H. Zhou, J. T. Barron, R. Martin-Brualla, N. Snavely, and T. Funkhouser. Ibrnet: Learning multi-view image-based rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690 4699, 2021. [45] K. Zhang, G. Riegler, N. Snavely, and V. Koltun. NERF++: Analyzing and improving neural radiance fields. https://arxiv.org/abs/2010.07492, 2020. [46] M. Niemeyer and A. Geiger. GIRAFFE: Representing scenes as compositional generative neural feature fields. https://arxiv.org/abs/2011.12100, 2020. [47] M. Guo, A. Fathi, J. Wu, and T. Funkhouser. Object-centric neural scene rendering. https://arxiv.org/abs/2012.08503, 2020. [48] W. Yuan, Z. Lv, T. Schmidt, and S. Lovegrove. Star: Self-supervised tracking and reconstruction of rigid objects in motion with neural rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13144 13152, 2021. [49] Z. Wang, T. Bagautdinov, S. Lombardi, T. Simon, J. Saragih, J. Hodgins, and M. Zollhöfer. Learning compositional radiance fields of dynamic human heads. https://arxiv.org/abs/2012.09955, 2020. [50] J. Ost, F. Mannan, N. Thuerey, J. Knodt, and F. Heide. Neural scene graphs for dynamic scenes. https://arxiv.org/abs/2011.10379, 2020. [51] H.-X. Yu, L. J. Guibas, and J. Wu. Unsupervised discovery of object radiance fields. ar Xiv preprint ar Xiv:2107.07905, 2021. [52] B. Yang, Y. Zhang, Y. Xu, Y. Li, H. Zhou, H. Bao, G. Zhang, and Z. Cui. Learning objectcompositional neural radiance field for editable scene rendering. In International Conference on Computer Vision (ICCV), October 2021.

[53] D. Driess, Z. Huang, Y. Li, R. Tedrake, and M. Toussaint. Learning multi-object dynamics with compositional neural radiance fields. ar Xiv preprint ar Xiv:2202.11855, 2022. [54] H. Zhang, R. Wang, J. Zhang, C. Li, G. Yang, P. Spincemaille, T. Nguyen, and Y. Wang. Nerd: Neural representation of distribution for medical image segmentation. ar Xiv preprint ar Xiv:2103.04020, 2021. [55] L. Yen-Chen, P. Florence, J. T. Barron, A. Rodriguez, P. Isola, and T.-Y. Lin. i Ne RF: Inverting neural radiance fields for pose estimation. IROS, 2021. [56] M. Adamkiewicz, T. Chen, A. Caccavale, R. Gardner, P. Culbertson, J. Bohg, and M. Schwager. Vision-only robot navigation in a neural radiance world. IEEE Robotics and Automation Letters, 7(2):4606 4613, 2022. [57] J. Ichnowski, Y. Avigal, J. Kerr, and K. Goldberg. Dex-nerf: Using a neural radiance field to grasp transparent objects. ar Xiv preprint ar Xiv:2110.14217, 2021. [58] L. Yen-Chen, P. Florence, J. T. Barron, T.-Y. Lin, A. Rodriguez, and P. Isola. Ne RF-Supervision: Learning dense object descriptors from neural radiance fields. In IEEE Conference on Robotics and Automation (ICRA), 2022. [59] K. Karunratanakul, J. Yang, Y. Zhang, M. J. Black, K. Muandet, and S. Tang. Grasping field: Learning implicit representations for human grasps. In 2020 International Conference on 3D Vision (3DV), pages 333 344. IEEE, 2020. [60] J.-S. Ha, D. Driess, and M. Toussaint. Learning neural implicit functions as object representations for robotic manipulation. ar Xiv preprint ar Xiv:2112.04812, 2021. [61] A. Simeonov, Y. Du, A. Tagliasacchi, J. B. Tenenbaum, A. Rodriguez, P. Agrawal, and V. Sitzmann. Neural descriptor fields: Se (3)-equivariant object representations for manipulation. ar Xiv preprint ar Xiv:2112.05124, 2021. [62] Y. Wi, P. Florence, A. Zeng, and N. Fazeli. Virdo: Visio-tactile implicit representations of deformable objects. ar Xiv preprint ar Xiv:2202.00868, 2022. [63] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484 489, 2016. [64] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597 1607. PMLR, 2020. [65] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729 9738, 2020. [66] A. Srinivas, M. Laskin, and P. Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. ar Xiv preprint ar Xiv:2004.04136, 2020. [67] B. You, O. Arenz, Y. Chen, and J. Peters. Integrating contrastive learning with dynamic models for reinforcement learning from images. Neurocomputing, 2022. [68] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine, and G. Brain. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1134 1141. IEEE, 2018. [69] A. Kinose, M. Okada, R. Okumura, and T. Taniguchi. Multi-view dreaming: Multi-view world model with contrastive learning. ar Xiv preprint ar Xiv:2203.11024, 2022. [70] K. Chen, Y. Lee, and H. Soh. Multi-modal mutual information (mummi) training for robust selfsupervised deep reinforcement learning. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4274 4280. IEEE, 2021. [71] S. Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta. R3m: A universal visual representation for robot manipulation. ar Xiv preprint ar Xiv:2203.12601, 2022. [72] A. Stooke, K. Lee, P. Abbeel, and M. Laskin. Decoupling representation learning from reinforcement learning. In International Conference on Machine Learning, pages 9870 9879. PMLR, 2021. [73] S. Parisi, A. Rajeswaran, S. Purushwalkam, and A. Gupta. The unsurprising effectiveness of pre-trained vision models for control. ar Xiv preprint ar Xiv:2203.03580, 2022.

[74] D. Yarats, A. Zhang, I. Kostrikov, B. Amos, J. Pineau, and R. Fergus. Improving sample efficiency in model-free reinforcement learning from images. ar Xiv preprint ar Xiv:1910.01741, 2019. [75] A. Zhang, R. Mc Allister, R. Calandra, Y. Gal, and S. Levine. Learning invariant representations for reinforcement learning without reconstruction. ar Xiv preprint ar Xiv:2006.10742, 2020. [76] K. Zakka, A. Zeng, P. Florence, J. Tompson, J. Bohg, and D. Dwibedi. Xirl: Cross-embodiment inverse reinforcement learning. In Conference on Robot Learning, pages 537 546. PMLR, 2022. [77] T. Xiao, I. Radosavovic, T. Darrell, and J. Malik. Masked visual pre-training for motor control. ar Xiv preprint ar Xiv:2203.06173, 2022. [78] Y. Seo, K. Lee, S. James, and P. Abbeel. Reinforcement learning with action-free pre-training from videos. ar Xiv preprint ar Xiv:2203.13880, 2022. [79] T. Lesort, N. Díaz-Rodríguez, J.-F. Goudou, and D. Filliat. State representation learning for control: An overview. Neural Networks, 108:379 392, 2018. [80] M. Niemeyer and A. Geiger. Giraffe: Representing scenes as compositional generative neural feature fields. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021. [81] K. Stelzner, K. Kersting, and A. R. Kosiorek. Decomposing 3d scenes into objects via unsupervised volume segmentation. ar Xiv preprint ar Xiv:2104.01148, 2021. [82] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [83] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2304 2314, 2019. [84] R. Hadsell, S. Chopra, and Y. Le Cun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 06), volume 2, pages 1735 1742. IEEE, 2006. [85] A. Van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. ar Xiv e-prints, pages ar Xiv 1807, 2018. [86] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. ar Xiv preprint ar Xiv:1707.06347, 2017. [87] D. Driess, J.-S. Ha, M. Toussaint, and R. Tedrake. Learning models as functionals of signeddistance fields for manipulation planning. In Conference on Robot Learning (Co RL), 2021. [88] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961 2969, 2017. [89] A. Raffin, A. Hill, M. Ernestus, A. Gleave, A. Kanervisto, and N. Dormann. Stable baselines3.

https://github.com/DLR-RM/stable-baselines3, 2019.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes]

(c) Did you discuss any potential negative societal impacts of your work? [Yes] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See website. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See appendix. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] See plots in main paper. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See appendix. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] In the paper and in the code. (b) Did you mention the license of the assets? [Yes] In the code.

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]

See website. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [Yes] In the code. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]