# representing_spatial_trajectories_as_distributions__3e6674a9.pdf

Representing Spatial Trajectories as Distributions

Dídac Surís Columbia University didac.suris@columbia.edu

Carl Vondrick Columbia University vondrick@cs.columbia.edu

We introduce a representation learning framework for spatial trajectories. We represent partial observations of trajectories as probability distributions in a learned latent space, which characterize the uncertainty about unobserved parts of the trajectory. Our framework allows us to obtain samples from a trajectory for any continuous point in time both interpolating and extrapolating. Our flexible approach supports directly modifying specific attributes of a trajectory, such as its pace, as well as combining different partial observations into single representations. Experiments show our method s advantage over baselines in prediction tasks. See trajectories.cs.columbia.edu for video results and code.

1 Introduction

The visual world is full of objects moving around in predictable ways. Examples of these spatial trajectories include human motion, such as people dancing or exercising; objects moving, such as a ball rolling; trajectories of cars and bicycles; or animal migration patterns. Evidence suggests that the human perceptual system encodes motion into high-level neural codes that represent the motion holistically, going beyond the specific input observations [22]. Humans use this abstract representation for downstream tasks like inferring intention [6]. Computer vision systems likely need similar mechanisms to encode trajectories and motions into global representations.

Representation learning has been transformative in other domains such as images and text for its ability to obtain high-level representations that reorganize the information in the input, and are better at downstream tasks than the original signals. A global representation of trajectories would allow us to evaluate a trajectory at any point in time, even ones not yet observed. However, modeling trajectories presents a series of challenges for representation learning. First, in real-time scenarios, the future of the trajectory is never observed. Second, temporal and spatial occlusions may impede observing part of a trajectory. Third, trajectories are by nature continuous in time. And finally, a trajectory-level metric is usually not well defined and application-dependent.

We propose a representation learning framework for trajectories that deals with all these challenges in a unified way. Our key contribution is the representation of a partial observation of a trajectory as a probability distribution in a learned latent space, that represents all the possible trajectories the observation could have been sampled from. Our framework s simplicity and generality allows it to be flexible: it does not constrain the input-space metric, accepts observations of different lengths and at any (irregularly sampled) point in time, can be implemented using different families of latent space distributions, and is capable of performing inference-time tasks for which it has not been explicitly trained. Our experiments on human movement datasets show that our method can accurately predict the past and future of a trajectory segment, as well as the interpolation between two different segments, outperforming autoregressive baselines. Additionally, it can do so for any continuous point in time. We also show how we can modify given trajectories by manipulating their representations.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Input Input

Output (interpolation)

Output (extrapolation)

Figure 1: Predictions on figure skating data (Fis V). Our model is capable of predicting the future (top row), past, and interpolation (bottom row) of a trajectory given partial observations, at any continuous time. The inputs to the model are the keypoints in the images. See more examples in Fig. 4.

2.1 Framework, Definitions and Notation

The input to our framework is a sequence of samples obtained from a (continuous in time and infinite) spatial trajectory u, which we define as the continuous temporal evolution of a set of spatial coordinates. We call each sample a point x, which lives in the input space RK. We call the sequence of points, together with the times t at which they were sampled, a segment s, which can be understood as a partial observation of u. We define a distance metric δ between points x in the input space.

Our goal is to transform these measurements of motion s into a representation z that will be useful for downstream tasks. We define a latent space RN of trajectories z. Each z in this space represents the full extent of a trajectory, both in time and in space. We use Q to represent probability distributions over trajectories z, in the latent space. We define a distance function D between distributions of trajectories, which assumes an underlying distance function d between trajectories z.

We use an encoder Θ to encode every segment s to a probability distribution Q( ; s) = Θ(s) over trajectories, where Q(z; s) represents the probability that s was sampled from the trajectory represented by z. Additionally, we can decode a trajectory z at a specific time t by using a decoder Φ, obtaining a point x = Φ(z, t). Φ takes any continuous t as input. See Fig. 2 for a schematic.

2.2 Representation Learning

When observing a segment s, one may have some uncertainty about the specific trajectory it was sampled from. For instance, a segment showing a person jumping may correspond to a trajectory that continues with the person falling, or to a trajectory that proceeds with them doing a backflip and landing on their feet, but it will not belong to a trajectory of a person swimming. Therefore, we represent the segment as a distribution over trajectories, where Q(z; s) represents the likelihood of a trajectory given the segment. During training, the goal is to learn this mapping from the input space (segments of trajectories) to the latent space (distributions over trajectories).

Concretely, given two segments sa, sb that have been obtained from the same underlying trajectory, we want some z to exist such that its likelihood under the distributions Qa and Qb representing each of the segments is high. To encourage this, we train the model to maximize the overlap between the distributions Qa and Qb. Similarly, we minimize the overlap between (the distribution representations of) segments sampled from different trajectories, under the assumption that no trajectory z exists that contains both segments. Specifically, we minimize a self-supervised triplet loss:

(i,k+,k ) T max h D Qi, Qk+ D Qi, Qk + α, 0 i , (1)

where α is a margin hyperparameter, and T is a set of triplets: for every segment i in the dataset, we define several triplets by sampling pairs consisting of a positive segment k+ (such that (i, k+) is a positive pair) and a negative segment k (such that (i, k ) is a negative pair).

In addition to learning representations of trajectories, we also wish to be able decode them back to input-space points. To achieve this, we train a decoder Φ that allows us to obtain the specific value

Input Space Latent Space

Encoder (Transformer)

Figure 2: Schematic of our framework. We show the input space RK, the latent space RN, and the mappings between the two (encoder Θ and decoder Φ). A segment s belonging to a trajectory u is encoded into a distribution Q, from which a trajectory z is sampled and decoded at a time t, to get ˆxt.

of any trajectory at any continuous time t. In order to train the decoder Φ, we sample trajectories z Q( ; s) from each segment representation, and decode them at specific time-steps t which were contained in s, obtaining a prediction ˆxt = Φ(z, t) for which we have ground truth xt. There is no uncertainty in this prediction, as xt was part of the segment s in the first place; the decoder is only explicitly trained for reconstruction, not extrapolation. We train the decoder via regression, using the point-wise distance δ. Note that we never explicitly define a trajectory-level distance in the input space; it is implicitly learned by the model. The reconstruction loss is mathematically defined as:

i=1 Ez Θ(si)

t δ Φ(z, t), xi t , (2)

where N is the number of segments in the dataset, and T i is the number of points in segment si. We minimize Eqs. (1) and (2) jointly and end-to-end. We implement the encoder Θ using a Transformer Encoder architecture [54], and the decoder Φ using a Res Net [17]. See Appendix C for more details.

2.3 Creating Positive and Negative Pairs

In order to define positives and negative pairs for Eq. (1), we use the following:

Input-space relationships. The simplest way is to take segments from the same trajectory as positives and segments from other random trajectories as negatives. The initial segments can have different relationships, such as precedence, containment, or overlap [3]. In our experiments, we sample three segments for every trajectory: a past segment (P in Fig. 3), a future segment (F) whose starting time comes right after the end of the past segment, and a combination segment (C), which contains both the past and the future segments.

Intersection. An intersection I of two distributions Q in the latent space will represent all the trajectories that have a high likelihood for both intersected segments. Note that an intersection in the latent space is a union in the input space: the intersection constrains the possible trajectories to those that are consistent simultaneously for the two segments. Similarly, an intersection in the input space (assuming an overlap between segments) is a union in the latent space. In the latent space, the intersection of the past and future segments should be equal to the representation of the combination segment, and therefore the pair (C, I) is a positive one.

Re-encoding. Given a trajectory z, we can decode it into any set of times t, obtaining a new segment. This segment can be (re-)encoded using Θ, and a representation Q can be obtained for it, resulting in a new positive or negative for other segment representations. For example, when given the past we randomly sample a possible the future, the resulting segment (FP - future given past) will be different than the ground truth future, so the pair (F, FP) will be treated as a negative.

We exemplify a combination of these possibilities in Fig. 3. In order to determine which pairs of segments are positive, and which are negative, the rule is always the same: if they can belong to the

Input Space Latent Space

Past Future Combination Future given past Past given future Combination given past Combination given future Other

Figure 3: Examples of segments. We illustrate how spatial trajectories (left) are ideally encoded into the latent space (right). The intersection between two segment representations (boxes in the figure) represents the trajectories that contain the two segments. Future given past represents a segment decoded at a future time, from a trajectory sampled from the past representation. It is effectively a sample of a possible future given the past. Other segments are defined similarly. For clarity, we do not show other options like past given past , which would be the same box as past P. Best viewed in color.

same trajectory they are positives, otherwise they are negatives. For example, looking at Fig. 3 it is clear that, as discussed above, there is no trajectory that can contain both F and FP. We list all negative and positive pairs in Appendix C.1.

2.4 Comparing Distributions

Eq. (1) uses the distance function D to compare distributions of trajectories. In this section, we introduce two different ways of designing D, resulting in different intuitions about the latent space.

Symmetric Distance If two segments can belong to the same trajectory, the distributions Q representing each segment should be similar and close to each other (positives), and the (symmetric) distance D between them should be small. For example, in Fig. 3, the representations of the past P and future F segments belonging to the same trajectory are treated as positives.

Conditional Instead of computing a distance or a similarity, we compute the probability that a segment sa belongs to the same trajectory as another segment sb. We model this as a conditional probability P(Qa|Qb). There are four possibilities:

1. P(Qa|Qb) = 1, when sb includes sa, like the combination segment C including the past P. 2. 0 < P(Qa|Qb) < 1, when Qa is possible but not necessary given by Qb, like P and F in Fig. 3. 3. 0 < P(Qb|Qa) < 1, defined in a similar manner. 4. P(Qa|Qb) = 0, for unrelated segments, like C and O.

We treat the first three cases as positives, and the last case as a negative. Because pairs belonging to the first case have a stricter correspondence than those belonging to the second and third cases, we sample them more often during training. Note that under this interpretation, a past P and a future F from the same trajectory do not have a strong correspondence (first case), but a softer one (second and third cases): one does not fully define the other. This approach results in probability values that we either maximize (positives) or minimize (negatives), so we define D(A, B) = 1 P(A|B).

The previous approaches require a way of computing either a distance between the distributions Q, or a conditional probability between them. In the next section, we show two families of distributions for which these can be defined.

2.5 Trajectory Segments as Distributions

In order to obtain Q, the encoder Θ predicts the parameters of a distribution family. Conditions for the distribution families are: 1) we can sample from it in a differentiable way, 2) we can parameterize it, 3) we can compute, in closed form, an intersection that returns a distribution from the same family, and 4) we can compute either a similarity function or a conditional probability, or both (see Sec.2.4). Next, we introduce two distribution families that meet the previous criteria.

Normal distributions We use uncorrelated multivariate normal distributions, and parameterize them with a mean µ and a standard deviation σ. We compute the intersection as the product of two normal distributions, which remains normal when the dimensions are uncorrelated (see Appendix D). We use the symmetrized Kullback-Leibler (KL) divergence between distributions as a distance function. This distance is not a proper metric; alternatives are discussed in Appendix D. Normal distributions assume an underlying Euclidean distance metric d between trajectories z.

Box embeddings Box embeddings [55] represent objects with high-dimensional products-ofintervals (or boxes), parameterized by their two extreme vertices z and z . The intersection between box embeddings is well defined and results in another box embedding. This makes them a natural choice to represent conditional probabilities, which can be computed as P(A|B) = Vol(A B)/Vol(B), where Vol(A) = QN i max(z i z i , 0) is the volume of the box, and represents the intersection operation. These operations are straightforward to compute. Boxes are not actual distributions, as they need not integrate to one. However, they are easily normalized by dividing by their volume, and therefore they can be treated as distributions for all the practical purposes required in our framework (i.e. sampling, where we approximate the boxes with a uniform distribution). Symmetric distance functions can also be defined on box embeddings; we define a few in Appendix D.2.

In both cases, we use the reparameterization trick [24] in order to sample from the distributions while keeping gradient information. We found the best-performing option was using box embeddings under the conditional scenario; the values reported in Section 3.2 use this setting.

2.6 Inference

Once trained, our decoder Φ is able to decode a trajectory at any continuous time t, including times that were not part of the input. For example, our framework can decode a future segment given an input past segment, by sampling from its representation, and evaluating that sample at some future times. This future segment will not necessarily be equal to the ground truth future segment (in case it exists), because a single past can have multiple futures.

Overall, our framework is capable of doing 1) future and past prediction, by decoding a segment at times outside of its range; 2) continuous reconstruction given a discrete input, by decoding at any continuous time t; 3) interpolation between two segments, by decoding trajectories in their latentspace intersection; and 4) modifying existing trajectories, by manipulating the latent space. All the previous tasks are possible without explicitly training to do any of them. We show examples in Sec. 3.

3 Experiments

3.1 Datasets

For our experiments, we selected data adhering to the following criteria. First, there has to be uncertainty in the trajectory when given just a segment (for instance, the future is not fully specified given the past). Second, the prediction should not require external contextual information. Context can be seamlessly added to our architecture, but it involves additional task-specific engineering decisions, and we want our evaluation to be orthogonal to them. Similarly, we avoid trajectories that require highly-engineered point-level distances δ. Finally, we prefer our trajectories to be obtained from realworld data. For all the previous reasons, we implement our framework on human movement datasets.

Specifically, we extract keypoints from human action datasets using Open Pose [10]. For every video, we keep the most salient human trajectories. This results in sequences of dimension [L, 25, 2], where L is the number of frames in the trajectory, 25 is the number of joints in a human skeleton extracted by Open Pose, and 2 corresponds to the number of spatial coordinates for every joint. We refer to the

Table 1: Prediction results. We report the mean squared error (the lower the better) across keypoints, after normalizing each trajectory to be contained in a region of size 100 100. F, P and I stand for future , past and interpolation , respectively. Values are obtained over 10 runs with different test-time random seeds (changes include sampled segments and sampled z). An extended table with standard deviations is in Appendix E.

(a) Long sequences

Fine Gym Diving48 Fis V

F P I F P I F P I

VRNN [13] 15.85 15.93 16.10 23.51 27.97 25.66 14.95 15.03 15.08 Trajectron++ uni. [44] 9.54 9.98 9.73 11.67 16.52 11.98 11.42 11.85 11.68 Trajectron++ [44] 9.72 10.01 9.89 11.59 16.23 12.68 11.41 11.71 11.63 Traj Rep (ours, ablation) 8.82 9.07 7.57 10.00 11.74 10.06 10.62 11.27 9.70 + re-encoding (ours) 8.50 8.83 7.11 9.81 12.00 9.58 10.32 10.77 9.22

(b) Short sequences

Fine Gym Diving48 Fis V

F P I F P I F P I

VRNN [13] 12.77 13.20 13.40 18.36 20.14 19.86 13.26 13.44 13.45 Trajectron++ uni. [44] 7.80 8.28 7.48 9.05 10.36 8.29 9.23 9.68 8.86 Trajectron++ [44] 7.26 7.93 6.94 8.74 11.35 8.31 8.70 9.28 8.28 Traj Rep (ours, ablation) 6.49 6.59 5.15 6.94 6.99 5.00 7.83 8.17 6.01 + re-encoding (ours) 6.20 6.36 4.88 6.76 6.85 5.04 7.54 7.78 5.88

whole skeleton at every time-step the combination of all joints, resulting in a K = 50-dimensional vector as a point. As a distance function δ between points (i.e. skeletons) we use l2-norm distance per-joint, and average across all visible joints. We extract human movement trajectories from the Fine Gym [45], Diving48 [29] and Fis V [57] datasets, which correspond to gymnastics, diving and figure skating, respectively.

For each of the datasets, we experiment with short sequences (up to 10 time-steps, or slightly over one second) and long ones (up to 30 time-steps, representing slightly under four seconds of the trajectory), and report results for both. We provide more details on the dataset creation in Appendix B.

3.2 Quantitative Experiments

Baselines and ablations As baselines, we select trajectory-modeling methods that are capable of encoding uncertainty about the future. Variational RNNs [13] extend recurrent neural networks (RNNs) to the non-deterministic case, by modeling every step with a variational auto-encoder (VAE) [24]. Trajectron++ [44] is a state-of-the-art trajectory-modeling framework which also builds on top of RNNs and (conditional) VAEs [23]. Uncertainty is modeled as a Gaussian mixture model (GMM). We adapt Trajectron++ to our data, making the encoding and decoding as similar to our setting as possible (for fairness), while keeping the core of the framework intact. We train two Trajectron++ versions, one with uniformly-sampled inputs and outputs ( Trajectron++ uni. ), and a second one with non-uniform sampling, following the setup in our models ( Trajectron++ ). We also ablate our model, and report results with and without training with re-encoded segments.

Tasks and metric We evaluate our framework on three different tasks: future prediction, past prediction, and interpolation between two segments. Future prediction consists in predicting points from a future segment given a past segment. Past prediction is defined symmetrically. In the interpolation task, we input two separated segments (past and future) from a trajectory, and predict the segment in between them. In our model, we do so by decoding from the latent-space intersection of the two input segments. Baselines (which are autoregressive) are not capable of doing this combination, so we only use the past segment as input. Baselines are also not capable of directly

Input: 8 frames, 1 second

4.0 2.0 - 2.16 ... ...

1.0 1.2 1.4

Input: past

Prediction: future

(a) Future prediction. We show an example of a future prediction, where the input are eight irregularly sampled frames during one second (we only show three of them), and we predict up to three seconds into the future.

Input: future

Input: 8 frames, 1 second

Prediction: past

0.68 - 0.84 3.0 3.7 t=0.0s 4.2 1.32 2.04 2.68

(b) Past prediction. We show an example of a past prediction, where the input are eight irregularly sampled frames during one second (we only show three of them), and we predict up to three seconds into the past.

Input: past and future Prediction: middle

(c) Interpolation. We provide the model with skeleton keypoints coming from two separate segments, and sample points at times in between these two segments, from the intersection of the two segments in latent space. The model produces sensible interpolations that are not simply a linear interpolation at the joint level: in the first example, the gymnast first stands, then turns; in the second example, the gymnast swings right and left, just in time to end up meeting the future segment at the right position.

Figure 4: Predictions on gymnastics data (Fine Gym). We show examples of past, future, and interpolation predictions. The input to our model are (irregularly in time) sampled keypoints obtained from human movement datasets, and the outputs are predictions of the trajectories at different continuous times (past, future, or in between the inputs). The only input to the model are the keypoints, not the images. Results show our model s capabilities for modeling trajectories well outside of the input s temporal range, for dealing with spatial and temporal occlusions, and for doing so at a large temporal resolution. See Section 3.3 for a deeper analysis.

(a) Speed change.

(b) Temporal offset.

Figure 5: Temporal editing. We decode, for the same times t, different trajectories z in the latent space. These trajectories have been obtained by encoding the segment in the top row, and moving in small increments in the latent space along directions that represent speed (a) or temporal offset (b). We highlight in green and pink some correspondences between different decoded trajectories, to emphasize that the spatial trajectories are the same but with variations in some time-related attribute, such as speed or temporal offset.

performing past prediction, so we predict the future of the reversed trajectory instead. As a metric, we use the average of the l2-norm distance across joints, which is used by all methods during training, and report the best out of M = 10 samples, to account for multiple modes and uncertainty in the prediction. Note that our model has never been explicitly trained to perform any of the previous tasks.

We show results in Tables 1a and 1b, for long and short trajectories respectively. Our model outperforms baselines in all the tasks, which proves its value and flexibility. We also show how creating more interesting negatives with the re-encoding of decoded trajectories results in more accurate prediction results. However, our method performs well even without re-encoding.

3.3 Qualitative Experiments

We show examples of our model s inputs and outputs in Fig. 4. Specifically, we show future, past and interpolation predictions. The results reflect that the model learns to predict sequences up to four times longer than the input. They also show the large temporal resolution of our model: the model predictions evolve smoothly and sensibly for time-steps separated by a few hundredths of a second (Figs. 4a and 4b). When not all joints are present in the input (first frame in Fig. 4a), our model is still capable of reconstructing the full spatial extent of the position. Finally, note how the model can take

Ground truth

Ground truth

Ground truth

Ground truth

Input (past) Prediction (future)

Figure 6: Multiple futures. Given a few input frames, our model is capable of predicting the future. It does so by modeling a distribution over the possible trajectories: by sampling from this distribution, we can obtain different plausible futures given the input (past) segment. In this figure we show, for specific inputs, the ground truth future, as well as two different futures sampled from the input segment distribution, which have been sampled randomly. This figure shows that our model is indeed capturing the multi-modal nature of the trajectories under uncertainty.

as inputs irregularly sampled time-steps (Fig. 4b), which makes it adaptable to temporal occlusions. Baselines are only capable of predicting the future, and are restricted to predicting uniform time-steps.

Temporal editing The segments are directly tied to the temporal span they represent. For example, two segments with the exact same coordinates and evolution across time, but starting at different times will result in different (albeit similar) representations. The speed of a movement and the time in the trajectory where the movement is done are important attributes of that movement, and the representation should not be invariant to them: they belong to different trajectories. However, because these trajectories are very similar, our model learns to represent them close in the latent space. In Fig. 5, we show that the model encodes different temporal variations. Moving along specific directions in the latent space results in progressively faster trajectories (Fig. 5a), or in trajectories with an increasing temporal offset with respect to the original one (Fig. 5b). See Appendix E for details.

Representing multiple futures A crucial aspect of our formulation is the assumption that the future is uncertain, and that our model has to be capable of modeling the different modes of the trajectory distribution. In Fig. 6 we show examples of multiple predicted futures given a single past segment, proving that our model captures the multi-modal nature of the trajectories under uncertainty.

4 Related work

Modeling trajectories Spatial trajectories are usually modeled in the literature in an autoregressive (AR) fashion [47, 33, 26, 2, 49, 20, 44, 30, 58, 52, 25, 19], where trajectories are defined conditioned on previous time-steps. Despite their success, AR models are incapable of dealing with some of the challenges stated in Section 1, most notably they do not represent time as a continuous variable, they cannot model the full extent of a trajectory (simultaneously both past and future), and no learned trajectory-level metric can be obtained from them. Some of them model the uncertainty in the prediction [52, 25, 19, 20, 44, 13, 49], and we use two representative ones [13, 44] as our baselines. A different line of work is focused on representing segments of trajectories (not just points) as points in a latent space [62, 65, 60, 61, 28, 64, 9, 31]. However, they are not capable of modeling the full extent of a trajectory outside the limits of the considered segment. Additionally, the segment-level metric is either unstructured [62], or is explicitly given [65, 63, 64].

Continuous time Modeling time as a continuous signal has gained traction recently in fields such as graphics [56, 53, 39, 46] or physics modeling [8, 11], because it accurately represents the underlying (continuous) world being modeled. In the graphics neural implicit functions literature, time is used to condition the prediction of the network. We adopt the same approach in our decoder. We encode the set of continuous times in a segment by using a Transformer network [54], which by construction is permutation invariant, but allows temporal embeddings to be concatenated with the input, both discrete [5, 4] and continuous [53, 51].

Self-supervised representation learning Finding self-supervised representations for temporal data has been the subject of a large amount of work in domains such as trajectories [31], video [35, 16, 40], or audio processing [21, 42, 15]. Most methods, however, represent segments as simple points in a Euclidean space. Structured representations for temporal data [50, 36] allow the latent space to follow certain inductive biases, like our framework s idea that segments compose trajectories. We model segments as either normal distributions or box embeddings [55]. The latter have been used to represent hierarchical relationships in domans such as text [38, 34], knowledge bases [1], or images [41]. We use them to represent temporal information. In recent work, Park et al. [36] also model segments using normal distributions, where trajectories are weighted sums of the segment representations.

Acknowledgments and Disclosure of Funding

We thank Arjun Mani and Mia Chiquier for helpful feedback. This research is based on work partially supported by the NSF NRI Award #2132519 and the DARPA MCS program. DS is supported by the Microsoft Ph D Fellowship.

[1] Ralph Abboud, Ismail Ceylan, Thomas Lukasiewicz, and Tommaso Salvatori. Boxe: A box embedding model for knowledge base completion. Advances in Neural Information Processing Systems (Neur IPS), 33:9649 9661, 2020. [2] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and Silvio Savarese. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 961 971, 2016. [3] James F. Allen. Maintaining knowledge about temporal intervals. In Communications of the ACM, Volume 26 Issue 11, 1983. [4] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luˇci c, and Cordelia Schmid. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6836 6846, 2021. [5] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning (ICML), July 2021. [6] Sarah-Jayne Blakemore and Jean Decety. From the perception of action to the understanding of intention. Nature reviews neuroscience, 2(8):561 567, 2001. [7] Paul Bromiley. Products and convolutions of gaussian probability density functions. Tina-Vision Memo, 3(4):1, 2003. [8] Michael Bronstein. Beyond message passing: a physics-inspired paradigm for graph neural networks. The Gradient, 2022. [9] Judith Butepage, Michael J Black, Danica Kragic, and Hedvig Kjellstrom. Deep representation learning for human motion prediction and classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6158 6166, 2017. [10] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. [11] Ben Chamberlain, James Rowbottom, Maria I Gorinova, Michael Bronstein, Stefan Webb, and Emanuele Rossi. Grand: Graph neural diffusion. In International Conference on Machine Learning, pages 1407 1418. PMLR, 2021. [12] Tejas Chheda, Purujit Goyal, Trang Tran, Dhruvesh Patel, Michael Boratko, Shib Sankar Dasgupta, and Andrew Mc Callum. Box embeddings: An open-source library for representation learning using geometric structures. In Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2021. [13] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems (Neur IPS), volume 28, 2015. [14] Shib Sankar Dasgupta, Michael Boratko, Dongxu Zhang, Luke Vilnis, Xiang Lorraine Li, and Andrew Mc Callum. Improving local identifiability in probabilistic box embeddings. In Conference on Neural Information Processing Systems (Neur IPS), 2020. [15] Eduardo Fonseca, Diego Ortego, Kevin Mc Guinness, Noel E O Connor, and Xavier Serra. Unsupervised contrastive learning of sound event representations. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 371 375. IEEE, 2021. [16] Tengda Han, Weidi Xie, and Andrew Zisserman. Video representation learning by dense predictive coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR) Workshops, pages 0 0, 2019. [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 770 778, 2016. [18] John R Hershey and Peder A Olsen. Approximating the kullback leibler divergence between gaussian mixture models. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 07, volume 4, pages IV 317. IEEE, 2007. [19] Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li F Fei-Fei, and Juan Carlos Niebles. Learning to decompose and disentangle representations for video prediction. Advances in neural information processing systems, 31, 2018. [20] Boris Ivanovic and Marco Pavone. The trajectron: Probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2375 2384, 2019. [21] Aren Jansen, Manoj Plakal, Ratheet Pandya, Daniel PW Ellis, Shawn Hershey, Jiayang Liu, R Channing Moore, and Rif A Saurous. Unsupervised learning of semantic audio representations. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 126 130. IEEE, 2018. [22] Gunnar Johansson. Visual perception of biological motion and a model for its analysis. Perception & psychophysics, 14(2):201 211, 1973.

[23] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. Advances in neural information processing systems, 27, 2014. [24] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), 2014. [25] Adam Kosiorek, Hyunjik Kim, Yee Whye Teh, and Ingmar Posner. Sequential attend, infer, repeat: Generative modelling of moving objects. Advances in Neural Information Processing Systems, 31, 2018. [26] Philipp Kratzer, Marc Toussaint, and Jim Mainprice. Prediction of human full-body movements with motion optimization and recurrent neural networks. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 1792 1798. IEEE, 2020. [27] Xiang Li, Luke Vilnis, Dongxu Zhang, Michael Boratko, and Andrew Mc Callum. Smoothing the geometry of probabilistic box embeddings. In International Conference on Learning Representations (ICLR), 2019. [28] Xiucheng Li, Kaiqi Zhao, Gao Cong, Christian S Jensen, and Wei Wei. Deep representation learning for trajectory similarity computation. In 2018 IEEE 34th international conference on data engineering (ICDE), pages 617 628. IEEE, 2018. [29] Yingwei Li, Yi Li, and Nuno Vasconcelos. Resound: Towards action recognition without representation bias. In European Conference on Computer Vision (ECCV), pages 513 528, 2018. [30] Yunzhu Li, Jiajun Wu, Jun-Yan Zhu, Joshua B Tenenbaum, Antonio Torralba, and Russ Tedrake. Propagation networks for model-based control under partial observation. In 2019 International Conference on Robotics and Automation (ICRA), pages 1205 1211. IEEE, 2019. [31] Xiang Liu, Xiaoying Tan, Yuchun Guo, Yishuai Chen, and Zhe Zhang. Cstrm: Contrastive self-supervised trajectory representation model for trajectory similarity computation. Computer Communications, 2022. [32] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR), 2019. [33] Julieta Martinez, Michael J Black, and Javier Romero. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2891 2900, 2017. [34] Yasumasa Onoe, Michael Boratko, Andrew Mc Callum, and Greg Durrett. Modeling fine-grained entity types with box embeddings. In Annual Meeting of the Association for Computational Linguistics (ACL), page 2051 2064, 2021. [35] Tian Pan, Yibing Song, Tianyu Yang, Wenhao Jiang, and Wei Liu. Videomoco: Contrastive video representation learning with temporally adversarial examples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11205 11214, 2021. [36] Jungin Park, Jiyoung Lee, Ig-Jae Kim, and Kwanghoon Sohn. Probabilistic representations for video contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [37] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems (Neur IPS), pages 8024 8035. Curran Associates, Inc., 2019. [38] Dhruvesh Patel, Shib Sankar Dasgupta, Michael Boratko, Xiang Li, Luke Vilnis, and Andrew Mc Callum. Representing joint hierarchies with box embeddings. In Automated Knowledge Base Construction, 2020. [39] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10318 10327, 2021. [40] Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, and Yin Cui. Spatiotemporal contrastive video representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6964 6974, 2021. [41] Anita Rau, Guillermo Garcia-Hernando, Danail Stoyanov, Gabriel J Brostow, and Daniyar Turmukhambetov. Predicting visual overlap of images through interpretable non-metric box embeddings. In European Conference on Computer Vision (ECCV), pages 629 646. Springer, 2020. [42] Aaqib Saeed, David Grangier, and Neil Zeghidour. Contrastive learning of general-purpose audio representations. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3875 3879. IEEE, 2021. [43] Antoine Salmona, Julie Delon, and Agnès Desolneux. Gromov-wasserstein distances between gaussian distributions. ar Xiv preprint ar Xiv:2104.07970, 2021. [44] Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone. Trajectron++: Dynamicallyfeasible trajectory forecasting with heterogeneous data. In European Conference on Computer Vision (ECCV), pages 683 700. Springer, 2020. [45] Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2616 2625, 2020.

[46] Vincent Sitzmann, Julien N.P. Martel, Alexander W. Bergman, David B. Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. In ar Xiv, 2020. [47] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017. [48] Nishant Subramani, Samuel Bowman, and Kyunghyun Cho. Can unconditional language models recover arbitrary sentences? Advances in Neural Information Processing Systems, 32, 2019. [49] Chen Sun, Per Karlsson, Jiajun Wu, Joshua B Tenenbaum, and Kevin Murphy. Stochastic prediction of multi-agent interactions from partial observations. International Conference on Learning Representations (ICLR), 2019. [50] Dídac Surís, Ruoshi Liu, and Carl Vondrick. Learning the predictability of the future. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. [51] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. In Advances in Neural Information Processing Systems (Neur IPS), volume 33, pages 7537 7547, 2020. [52] Charlie Tang and Russ R Salakhutdinov. Multiple futures prediction. Advances in Neural Information Processing Systems, 32, 2019. [53] Basile Van Hoorick, Purva Tendulkar, Dídac Surís, Dennis Park, Simon Stent, and Carl Vondrick. Revealing occlusions with 4d neural fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [54] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (Neur IPS), volume 30, 2017. [55] Luke Vilnis, Xiang Li, Shikhar Murty, and Andrew Mc Callum. Probabilistic embedding of knowledge graphs with box lattice measures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 263 272, 2018. [56] Wenqi Xian, Jia-Bin Huang, Johannes Kopf, and Changil Kim. Space-time neural irradiance fields for free-viewpoint video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9421 9431, 2021. [57] Chengming Xu, Yanwei Fu, Bing Zhang, Zitian Chen, Yu-Gang Jiang, and Xiangyang Xue. Learning to score figure skating sport videos. IEEE transactions on circuits and systems for video technology, 30(12):4578 4590, 2019. [58] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeletonbased action recognition. In Thirty-second AAAI conference on artificial intelligence, 2018. [59] Sijie Yan, Yuanjun Xiong, Jingbo Wang, and Dahua Lin. Mmskeleton. https://github.com/ open-mmlab/mmskeleton, 2019. [60] Di Yao, Gao Cong, Chao Zhang, and Jingping Bi. Computing trajectory similarity in linear time: A generic seed-guided neural metric learning approach. In 2019 IEEE 35th international conference on data engineering (ICDE), pages 1358 1369. IEEE, 2019. [61] Di Yao, Gao Cong, Chao Zhang, Xuying Meng, Rongchang Duan, and Jingping Bi. A linear time approach to computing time series similarity based on deep metric learning. IEEE Transactions on Knowledge and Data Engineering, 2020. [62] Di Yao, Chao Zhang, Zhihua Zhu, Jianhui Huang, and Jingping Bi. Trajectory clustering via deep representation learning. In 2017 international joint conference on neural networks (IJCNN), pages 3880 3887. IEEE, 2017. [63] Guan Yuan, Penghui Sun, Jie Zhao, Daxing Li, and Canwei Wang. A review of moving object trajectory clustering algorithms. Artificial Intelligence Review, 47(1):123 144, 2017. [64] Wenyuan Zeng, Shenlong Wang, Renjie Liao, Yun Chen, Bin Yang, and Raquel Urtasun. Dsdnet: Deep structured self-driving network. In European Conference on Computer Vision (ECCV), pages 156 172. Springer, 2020. [65] Hanyuan Zhang, Xingyu Zhang, Qize Jiang, Baihua Zheng, Zhenbang Sun, Weiwei Sun, and Changhu Wang. Trajectory similarity learning with auxiliary supervision and optimal matching. International Joint Conference on Artificial Intelligence (IJCAI), 2020.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] , see Sections 2 and 3. (b) Did you describe the limitations of your work? [Yes] , we discuss them in Appendix A.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] , we discuss them in Appendix A. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplementary material or as a URL)? [Yes] , they are in the supplementary materials. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] , see Appendix C. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] , see Table 4. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] , see Appendix C. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] , we cite the datasets being used. (b) Did you mention the license of the assets? [No] , the licenses are mentioned in the respective dataset and library papers, which we cite. (c) Did you include any new assets either in the supplementary material or as a URL?

[Yes] , data and code are provided. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]